Rapport

U NIVERSITY OF T UNIS EL M ANAR
N ATIONAL E NGINEERING S CHOOL OF T UNIS

D EPARTMENT OF I NDUSTRIAL E NGINEERING
P ROJECT OF 2 ND Y EAR - M ODELING FOR I NDUSTRY AND S ERVICES
Machine Learning for Sentiment Analysis:

Twitter case
Authors:
Anis FAKHFAKH Supervisor:
Iyed AMOR Mrs, Imen BOUDALI
Mohamed TOUZI
University year: 2019/2020

Aknowledgement
First and foremost, praises and thanks to the God, the Almighty. Also, we would like to express
our sincere gratitude to our advisor Prof. Imen BOUDALI for the continuous support of our
project study and research, for her patience, motivation, enthusiasm, and immense knowledge.
Her guidance helped us in all the time of research and writing of this thesis. We could not have
imagined having a better advisor and mentor for our project study.
1
Contents
1 Introduction to Machine Learning 10

1.1 Relationships with other disciplines . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.2 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.4 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Machine Learning procedure . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Need for Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.4 Types of Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Machine learning techniques 23

2.1 Supervised machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Unsupervised machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Dimension Deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 About Deep Learning 33

3.1 The concept of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Deep learning vs Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Classification of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2
Contents Contents
3.3.1 Multi-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Sentiment Analysis by using Machine Learning Techniques 43

4.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 What is a linear model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 What is Logistic Regression? . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Hard-margin linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Soft Margin linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 SVM in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.4 Non Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Natural Language Processing 58

5.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Tokenizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Word Normalization, Lemmatization and Stemming . . . . . . . . . . . . . . . . . 59
5.4 Sentence segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Implementation and Simulation Results 63

6.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1.1 HTML Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.2 Mentions and hashtags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.3 URLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.4 cp1252 encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.5 Data Clean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Data visualisation and data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Train/Test process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Count Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3
Contents Contents
6.3.2 TFIDF Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4 Models test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4
List of Figures
1.1 Data Science vs Big Data vs Data Mining [36] . . . . . . . . . . . . . . . . . . . . 14

1.2 Classical approach for problem solving [23] . . . . . . . . . . . . . . . . . . . . . 15
1.3 Problem Solving with machine learning approach [23] . . . . . . . . . . . . . . . 15
1.4 Data Update feature in Machine learning [23] . . . . . . . . . . . . . . . . . . . . 16
1.5 Machine Learning Procedure [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Example of supervised training [23] . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Mechanism of Unsupervised learning [23] . . . . . . . . . . . . . . . . . . . . . . 19
1.8 Mechanism of Semi supervised learning [23] . . . . . . . . . . . . . . . . . . . . 19
1.9 Mechanism of Reinforcement learning [23] . . . . . . . . . . . . . . . . . . . . . 20
1.10 Mechanism of Online learning [23] . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.11 Mechanism of Instance-based learning [23] . . . . . . . . . . . . . . . . . . . . . 22
1.12 Mechanism of model-based learning [23] . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Classification of students based on their marks and final grades [1] . . . . . . . . 24
2.2 housing price [36] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Non clustered sequence [32] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Comparison between 2 different Clustering outputs [32] . . . . . . . . . . . . . . 30
3.1 ANNs structure [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Training process [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 ML and DL Data dependency [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 How does a perceptron function [23] . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Multi-layer Perceptron structure [23] . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Intial ball position [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 the prediction of the next position [6] . . . . . . . . . . . . . . . . . . . . . . . . 39
5
List of Figures List of Figures
3.8 Separate input networks [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9 The RNN structure [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.10 The looping mechanism of a RNN [5] . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.11 Auto encoder structure [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Sigmoid function curve [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Gradient descent explained in one dimension [7] . . . . . . . . . . . . . . . . . . 48
4.3 Local minimum 6= Global minimum [7] . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Different classifiers of the data set [23] . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 margin comparison between Scaled/unscaled data [23] . . . . . . . . . . . . . . 51
4.6 Main issues regarding SVM linear regression [23] . . . . . . . . . . . . . . . . . . 53
4.7 Comparison between both cases [23] . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8 SVM implementation on Python [23] . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9 Effect of adding features on linear separability [23] . . . . . . . . . . . . . . . . 56
4.10 Different Kennels representation [28] . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Most frequent words(positive labels) using WordCloud . . . . . . . . . . . . . . . 70

6.2 Frequency of Most frequent words for positive labeles . . . . . . . . . . . . . . . . 71
6.3 Most frequent words(negative labels) using WordCloud . . . . . . . . . . . . . . . 72
6.4 Frequency of Most frequent words for negative labeles . . . . . . . . . . . . . . . 72
6.5 Frequency of Most frequent words . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6 Most 500 frequent words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.7 Verification of the Near-Zipf distribution theory . . . . . . . . . . . . . . . . . . . 76
6.8 Frequency of the most 50 frequent words(positive labels) - Stop words removed . 79
6.9 Frequency of the most 50 frequent words(negative labels) - Stop words removed 80
6.10 Accuracy comparison for Logistic Regression models, uni-grams - Count Vectorizer 85
6.11 Accuracy comparison for SVM models, uni-grams - Count Vectorizer . . . . . . . . 87
6.12 Accuracy comparison for the chosen SVM and Logistic Regression models, uni-
grams - Count vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.13 Accuracy comparison (uni-grams and bi-grams) for Logistic Regression model -
Count Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.14 Accuracy comparison (uni-grams, bi-grams and tri-grams) for Logistic Regression
model - Count Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6
List of Figures List of Figures
6.15 Accuracy comparison (uni-grams, bi-grams and tri-grams) for Logistic Regression
model - TFIDF Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.16 Accuracy comparison (uni-grams, bi-grams and Trigrams) for Logistic Regression
model - Count Vectorizer and TFIDF Vectorizer . . . . . . . . . . . . . . . . . . . 95
7
List of Tables
2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1 Logistic Regression, Count Vectorizer, Uni-grams . . . . . . . . . . . . . . . . . . . 97

6.2 Logistic Regression, Count Vectorizer, Bi-grams . . . . . . . . . . . . . . . . . . . 98
6.3 Logistic Regression, Count Vectorizer, Tri-grams . . . . . . . . . . . . . . . . . . . 98
6.4 Logistic Regression, TFIDF, Uni-gram . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 Logistic Regression, TFIDF, Bi-grams . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6 Logistic Regression, TFIDF, Tri-grams . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8
Introduction
Twitter is an online plateform allowing more than 350 million users around the world to navi-
gate and communicate through posts called tweets. As we surf through the tweets on Internet,
we may find some of them with very angry reacts, hyping ones or any kind of reaction. Cer-
tainly, our mind could analyze these tweets and guess what sentiments are they showing.
The tweets case is not the only one with such analysis. IMDd film reviews, Amazon reviews
and a lot of different cases highlighted the sentiments analysis. The idea started when Artificial
Intelligence started to evolve and dealt with more complicated tasks such as emotion studying,
sentiment analysis and opinion mining.
Sentiment Analysis is classifying a data related to natural language as positive or negative.
In this particular project, we will consider working with twitter, since it is a rich source for
hundreds of millions of tweets per day. Besides, it has been the subject of many sentiment
analysis studies such as the prediction of the USA latest elections winner or the result of federal
elections in Germany based on a large set of tweets. Twitter has therefore proved its efficiency
in reflecting the sentiment.
As Data Science and Machine Learning are emerging domains, it has been our motivation to
deal with such problem for our 2nd year project. Even if it was a partially new domain for us, we
decided to dedicate ourselves and gain the necessary knowledge and fundamentals, essentially
the mathematics behind it, to be able to treat this project.
The choice of Sentiment Analysis itself was also challenging for us, because it is not just a simple
AI problem. Sentiments Analysis is a sub-domain from Natural Language Processing, which is
nowadays largely used and is still developing.
The large amount of tweets we can get is of tremendous use and is increasing exponentially
every day. The aim of this project is to define the fundamentals of Machine Leaning, build
models/algorithms predicting the analysis of a tweet, then comparing their performances.
This report is organized into five chapters as follows. In the first chapter, we introduce the
discipline of Machine learning: basic concepts and types. Then, an overview of the most com-
mon machine learning techniques is presented in the second chapter. In the third chapter, we
focus on the deep learning and its basic concept of Artificial Neural Networks. In order to deal
with our sentiment analysis problem, we detailed the proposed models : Regression models
and Support Vector Machine. Finally, the implementation and simulation results are discussed
in the last chapter.
9
Chapter 1
Introduction to Machine Learning
Introduction
In this chapter, we present the basic concepts related to machine learning: definitions and the
general procedure of learning. Relationships with other disciplines are also highlighted.
Moreover, the different types of machine learning according to some criteria are detailed.
1.1 Relationships with other disciplines
Given the multidisciplinary nature of machine intelligence, we need first to introduce some
basic concepts we need in this work such as: Artificial intelligence, data science, big data.
1.1.1 Artificial Intelligence
Artificial intelligence (AI) was originally described as the way computers and devices can simu-
late the behaviour and capacities of a human brain, or even exceed when it comes performing
mental tasks [36].
A more modern approach is defined as the way a machine is “thought” to think similarly to a
human brain by analysing their behaviour and environment to determine the way to operate in
different situations and circumstances [36].
10
1.1. Relationships with other disciplines Chapter 1. Introduction to Machine Learning
Artificial intelligence includes many subsets. The most common subset are the following:
• Machine Learning and Deep Learning
• Natural Language processing
• Expert System
• Robotics
• Complex problem solving
• Machine Vision
• Speech Recognition.
Given our interest for the machine learning subset in our work, we give more details about this
discipline in Section 2.
1.1.2 Data Science
As computers and their capacity has considerably evolved especially through the years of the
last decade, the manipulation of massive amount of data has become a more reachable objec-
tive.
Data used to be calculated and processed manually through the “classical statistics”. This tool
may seem enough to manipulate a certain amount of data. However, as soon as we start deal-
ing with a quite large dataset, a more complex data containing multi-type characteristics and
different forms, high data values; the tools offered by the statistics show a need for automating
due to the incapability of the human brain to do the process efficiently and in an acceptable
period of time [36].
Data science has therefore emerged from the need of studying quite large volumes of data. It is
a relatively new discipline and has considerably grown over the last years with the development
and wide of computing performances necessary for the data collection, storing and drawing the
patterns from a large amount of data [36].
11
1.2. Basic concepts Chapter 1. Introduction to Machine Learning
1.1.3 Data Mining
Data mining is a technique based on previously unknown patterns and regularities through a
large amount of dataset, without starting with a set of hypothesis [36].
Data mining may seem very close to what Machine Learning refers to and they may even use
the same algorithms, considering they are both concerned with data analysis and extracting
insights [36].
However, there is a slight difference between them: Machine Learning tends to improve with
experience, while Data Mining focuses on discovering those unrevealed patterns.
To put it more simply, machine learning tends to explore the results and operate in a way that
leads to a better result/accuracy. Data Mining focuses on the exploring and searching the yet
undiscovered knowledge. It is almost the same minor difference of approaches followed by a
statistician and a machine learning engineer [36].
1.1.4 Big Data
When a dataset described by its value, variety and velocity it possesses advanced ways of pro-
cessing. Big data describes how reliable the datasets are on technology in terms of analysis and
management. In other words, it is a collection of data that would virtually make sense for a
human only by a computer assistance.
The term is not about how many lines or columns the dataset disposes of. As data is increasing
exponentially each day, Big Data refers to the “power of generating newer and much larger sets
of data to explore undiscovered information [36].
1.2 Basic concepts
Machine Learning (ML) may seem as a futuristic aspect or a freshly-emerging discipline, but it
has existed since decades. Its most known aspect may be the spam-filter technique [35].
Today, its application fields are infinite such as voice recognition, face recognition, achiev-
ing customer’s satisfaction through data analysis, chat bots that learn through conversations,
driverless cars, etc.
To first define the word “learning”, let us take a simple experience called “Bait Shyness“. It is a
study on rats behaviour when encountering food or bait, they have previously tasted.
12
When rats encounter food with an unusual look or smell, they will first eat small amounts for
precaution. Their next behaviour will depend on the taste of that food from the first experience.
If the food produced afterwards an illness or discomfort, it will be classified as poisonous food
or associated with illness.
So subsequently, the rat will not eat the same food if ever encountered again. This shows
a simple learning mechanism by just a rat using its past experience to acquire expertise in
detecting the safety of the food [32].
A good learner is able to produce a generalisation pattern to treat not only the faced samples
but also newer and more challenging ones.
In this example, a simple rat or animal could do such tasks, unlike a human brain. So, can a
machine, which is programmed by rules and patterns interpreted by the human brain, proceed
as well?
Can we really say that a machine learns?
1.2.1 Definition
Machine learning is at the same time a science and an art; a field of study that characterises
computers ability to learn from data without being explicitly programmed [35].
From a more engineering-oriented point of view, it is defined as:
A computer program learns from experience E with respect to some task T and some perfor-
mance measure P, if its performance on T, as measured by P, improves with experience E [35].
Machine learning uses statistical methods and self-improving algorithms to learn just like hu-
mans learn from their previous experiences but generally more efficiently.
Sometimes, interpreting the pattern or extracting information from the data can’t be an evident
task. With the rising availability of the datasets, the demand for Machine Learning algorithms
to deal with these complexities is becoming more and more important[22].
13
The figure 1.1 shows an overall scheme of relationships between the introduced concepts:
Figure 1.1: Data Science vs Big Data vs Data Mining [36]
As we have seen in previous sections, Machine learning is an interdisciplinary field sharing

common threads with the mathematical fields of statistics, information theory, game theory,
and optimization [32].
As a subfield of computer science, Machine learning aims to program machines to learn from
experiences. In a sense, machine learning can be viewed as a branch of AI, since it is designed
to search for the meaningful patterns in complex sensory data [32].
However, Machine Learning is not about making an automated program imitating intelligent
behaviour. Instead, it takes benefit of the special abilities of machines in processing and ma-
nipulating huge datasets to discover hidden patterns that are hard to find by human perception
[32].
We know that Machine learning is about training the randomly generated data in order to
draw conclusions related to the environment. This process gives references to the statistical
approaches and methods [32]. The two disciplines have many shared elements, but also some
significant differences of emphasis.
Let us suppose that a doctor comes up with the hypothesis that there is a correlation between
smoking and heart disease.
While the statistician role is to use the data in order to check the validity of the hypothesis,
Machine learning aims to conclude a description of the pattern between smoking and heart
disease [32].
14
1.2.2 Machine Learning procedure
The classical approach for solving problems is illustrated in figure 1.2:
Figure 1.2: Classical approach for problem solving [23]
Using the machine learning approach, an algorithm is more eligible to learn by “its own” with-
out needing to set all the rules. Figure 1.3 shows this algorithm:
Figure 1.3: Problem Solving with machine learning approach [23]
15
Notice that data is more likely to be updated regularly, as it learns from the user ( like the
repeatedly used words in e-mails). A data update feature is then added to enhance the model
(see figure 1.4):
Figure 1.4: Data Update feature in Machine learning [23]
After presenting the main features of machine learning approach, we can conclude its general
procedure as illustrated in the diagram of figure 1.5:
Figure 1.5: Machine Learning Procedure [30]
As the model is fed with a training data, the chosen machine learning algorithm builds its
predictive model by choosing the best parameters. This phase is called parameter tuning. Then,
the built model is used predicts new instances.
16
1.2.3 Need for Machine learning
In order to highlight the need for machine learning, the following question has to be asked:
when does programming our computer is not sufficient to solve problems [32]?
• The important complexity of tasks: Some tasks are too complex to be addressed with clas-
sical rules. Some routine tasks may seem at first quite simple. However, our introspection
of how they are done is not capable of interpreting an explicit algorithm. Such tasks in-
clude: driving, speech recognition, and image understanding.
Machine Learning shows its big importance as these tasks are well performed when a
learning procedure of their previous experiences is employed. They can give satisfying
experiences in case of sufficient amount of data that refer to experience [32].
Other tasks are even beyond human capabilities. In fact, with the fast increasing amount
of recorded data, it becomes too hard for the human brain to analyse, predict and treat.
Such tasks include: weather prediction, treating astronomical data, analysis of genomic
data, etc.
Learning how to take benefit and meaningful patterns out of these large data is a promis-
ing domain especially with the exponential growth of memory capacity and processing
speed of computers [32].
• Adaptivity: Classical programmed tools are often rigid and face difficulties to adapt to
evolving situations.
Machine learning algorithms adapt to their input data and change from a user to another.
As they are also in interaction with their environment, they are capable of adapting to the
external changes.
This shows an important advantage especially in applications such as: decoding hand-
written text, spam detection filters, speech recognition problem, etc [32].
1.2.4 Types of Machine learning
Machine Learning problems are classified according to several criteria in order to choose the
convenient algorithm and the solving approach. The most common criteria are:
• Supervision learning: supervised, unsupervised, semi-supervised, reinforcement learn-

ing
• Data income form: batch or online learning
• Generalization method: instance or model based learning
17
Supervision category
It indicates whether the learning is supervised, unsupervised, semi-supervised or reinforced.

Since learning involves an interaction between the learner and the environment, learning tasks
can be classified according to the nature of that interaction: the type and amount of supervision
they get during training [32].
• Supervised training: it refers to learning guided by human observation and feedback

with known outcomes [36]. In supervised learning, the dataset is the collection of labelled
examples {(xi , yi ), 1 ≤ i ≤ N}, which means with already known outcomes.
Each element xi among N is called a feature vector. A feature vector is a vector in which
each dimension j = 1,..., D contains a value that describes the example somehow. That
value is called a feature and is denoted as x( j) [21].
Let us take a simple example of a model predicting whether a person is obese, skinny or
healthy depending on his age, weight and height. Let us suppose we have a dataset of
1000 Samples. While x(1) , x(2) and x(3) are respectively age, weight and height, y is the
output (it can be a divided into 3 classes: 0 for skinny, 1 for healthy and 2 for fat).
(i) (2)
x j is the feature i of the sample j. For example, x100 is the weight if the 100th person in
our dataset.
Supervised learning tends to use the dataset to produce a model taking the feature vector
x for input and outputs the information to deduct the corresponding label for that feature
vector [21].
So, our input is not anymore what we used to define as input in the classic resolution
approach. Even the true output values are considered as “input” for our model here.
Supervised training can be recapitulated in figure 1.6. The example shows classified e-
mails with labels: spam or not.
Figure 1.6: Example of supervised training [23]
Supervised algorithms, with tags applied, include Linear Regression, Logistic Regression,
Neural Networks, and Support Vector Machine algorithms, Decision Tree, Naïve Bayes,
k-Nearest neighbours, Random forests,etc [36].
18
• Unsupervised training: When the algorithm does not need the intervention of integrated
feedback (using unlabelled tags), it is classified as unsupervised.
In this case, the algorithm deals with the data input x and tries to cluster it to enhance
the algorithm and make the data more practically usable.
Unsupervised learning algorithms let you discover hidden patterns within the data that
were not totally evident or which you were not aware of. Notice that clustering means
grouping together data points that possess similar features [36]. The dataset is a collec-
tion of unlabelled examples {xi , 1 ≤ i ≤ N}.
Figure 1.7: Mechanism of Unsupervised learning [23]
• Semi supervised training: The dataset contains a small amount of labelled examples,
and a much higher one of unlabelled ones. The algorithm uses the unlabelled data to
compute a better model [21].
To make the concept easier, we can treat the photo-hosting services, such as Google Pho-
tos: when you upload group photos, the algorithm recognizes the presence of the same
person in photos 1, 8 and 21 for example. That is the unsupervised part of the algorithm,
called clustering. The supervised part is giving a name to each class (person): one label
per person; so the algorithm will be able to recognize all the faces in the photos[23].
Figure 1.8 illustrates this mechanism.
Figure 1.8: Mechanism of Semi supervised learning [23]
19
• Reinforcement learning: This type of learning shows the interaction between the agent
and the environment as shown in figure 1.9. The environment sends states to the agent
which observes them then selects and performs actions. The agent gets rewards based
on the given feedback. So, the agent learns by itself the best strategy called policy to
constantly get the most reward over time.
Reinforcement learning is used in many robots such as walking robots and DeepMind’s
AlphaGo [23].
Figure 1.9: Mechanism of Reinforcement learning [23]
Data income form
We classify a machine learning system on whether it incrementally learns from incoming data
or not: Batch learning and Online learning [23].
• Batch learning: The system is incapable of learning incrementally. All the available data
is used for the training at once taking an important processing time. The algorithm just
applies what it has learned. This is also called offline learning [23]. Any new data needs
to be merged with the old data already processed and train the whole new dataset again.
However, updating the dataset and doing the training from the start every time is not a
useful task and exhausts too much processing time and computing resources especially
when we are dealing with systems with data needing to be constantly updated and fed.
• Online learning: While batch learning does the training once and for all, online learning
is about sequentially feeding the system with data instances (mini-batches). Each learning
step is fast and cheap, allowing the system to be updated and learn about new data on
the fly, as it arrives. This type of learning is used in systems that require continuous
and relatively small data support flows in order to quickly adapt the model to the new
updates. This type of learning saves us data storage memory: Once a new data flow is
used to update the algorithm, it is possible to discard it because it is not needed anymore
[23].
20
The corresponding mechanism is shown in figure 1.10
Figure 1.10: Mechanism of Online learning [23]
Instance or model based:
We can classify the ML algorithms based on how they generalize. A machine learning algorithm
predicts a result based on a model he built with the training set. It would be: instance-based or
model-based.
• Instance-based learning: This is a lazy type of learning. The learner learns a particular
type of pattern and applies it to the newly fed data (see figure 1.11). The algorithm is
based on similarities and acts on the training set by comparison. However, its complexity
increases when the training set size increases [22]. The most common example is the
spam-filter which classifies e-mails, based on an already flagged by the user list of e-mails.
21
This requires a measure of similarity between two e-mails [23].
Figure 1.11: Mechanism of Instance-based learning [23]
• Model-based learning: This type of learning does not learn the instances by heart, but
instead generalizes from the set of examples in order to conclude a model and use to
make the predictions for new instances (see figure 1.12).
Figure 1.12: Mechanism of model-based learning [23]
Conclusion
In this chapter, we presented an introduction of the machine learning fundamentals. Given the
discussed types of learning, we will focus in the next chapter on the most common techniques
in literature.
22
Chapter 2
Machine learning techniques
Introduction
In this chapter, we briefly present the most important ML techniques according to their type: su-
pervised and unsupervised leraning. We will also discuss the approach of model generalization
and the performance measures for these techniques.
2.1 Supervised machine learning
Supervised learning is applied when we have a labeled piece of data that we want to predict or
explain. We find classification and regression methods as detailed in this section.
2.1.1 Classification
Classification is a famous machine learning algorithm where the output is discrete. The most
known algorithms are: Support Vector Machines, Logistic Regression, Naïve Bayes Classifiers,
Nearest Neighbors. . .
Binary classification
Binary classification is the most frequently studied kind of problem in ML and due its large
application, it has led to important algorithmic and theoretic developments [35].
It is possible to model our output by a “True” class (1) and a False class (-1). So basically,
Y ∈ {−1, 1}.
It is generally explained by a simple question: Given data X labeled with a binary class Y, to
which class would a new unlabeled sample be assigned to [35]?
23
2.1. Supervised machine learning Chapter 2. Machine learning techniques
In order to understand this algorithm, let us imagine a situation where the goal is to make
a decision on whether to accept or reject a student based on his grades. In order to make a
decision we need some inputs from the students; for this case we will take their test grades and
final year exam grades [1].
Here we are trying to predict, based on the date of student A and B, whether Student C will
be accepted or not. We can model this situation by assigning -1 to class Rejected and 1 to class
Admitted. In order to make a decision we can plot this input on a graph as shown in figure 2.1:
Figure 2.1: Classification of students based on their marks and final grades [1]
In this image, the first thing that we can observe is that data is separated to 2 classes: admitted
students and rejected students. That particular line presents the considered model. The blue
dots form the data points of which students got accepted, while the red ones refer to the rejected
ones. Even if there are some errors in classification, it is usual to have uncertainty in every
model: It is a probabilistic approach and not a certain one [1].
Based on this model, if a new student is represented above the line, it is classified as Admitted,
otherwise he is Rejected.
Multiclass classification
Multiclass Classification is basically the extension of binary classification. The only difference
is the number of classes our output might have [35].
However, it is not about True or False anymore. Since we may encounter more than 2 classes,
each class is the opposite of the union of the rest of the classes. For instance, it is possible
to classify an article according to the language it was written in (English, French, German. . . )
based on certain input features. There is a possible transition from multiclass classification to
binary classification.
24
From Multi-class to binary classification
It is given that some algorithms (Random Forest, naïve Bayes. . . ) are capable of directly han-
dling multi-class classification (more than 2 classes) directly. Others such as Support Vector
Machines or Linear classifiers are strictly binary classifiers. However, there exist two strategies
to cope with the issue [23].
• One versus all classifier (OvA): Given K classes, we put K classifiers (Ck )1≤k≤K . Each
classifier Ck is a binary one with its respective output equal to True (1) if its prediction is
class k, and False (0) if its prediction is another class [23].
Then, to classify a sample, we simply calculate the score of that sample with each classifier.
Then, the classifier with the highest score associates the corresponding class to that sample
[23].
• One versus one classifier (OvO): This strategy associates a classifier Ci,j to each pair of
different classes (class i and class j), giving a total of : K2 = K(K−1)

2 binary classifiers,
where the classifier Ci,j has an output of True if class i is predicted and False if class j is
predicted [23].
We notice that each classifier is trained only on the samples with labels i and j. So, Ci,j and Cj,i
have the same “mini training dataset”.
Then, the class with the highest number of predictions (which got predicted the most) gets to
be the predicted class for the combined classifier [23].
This classifier is a bit costly in computing time and resources since it increases the number of
classifiers trained considerably when the number of classes increases.
In most classification problems, the OvA classifier is preferred.
Performance
Evaluation is very important to achieve the optimal classifier for a dataset. The sumplest way
to think of evaluating the performance in a classification problem is to calculate the percentage
of the correctly classified instances.
Although this method can tell us if a model is approximately a good or bad estimator, it has
several inconvenients such as relatively low distinctiveness, low discriminability, low bias and
low informativeness to an important part of the data [24]. There are other metrics used to
tell us more on how is our model is performing avoiding ambiguities. In fact, an ambiguity
case may be very simple: Imagine we have a binary classifier that classifies any data sample as
class 1. Then, all the test data which is labeled with that class will be considered as correctly
predicted. If, for example, at least half of the data is labeled with class 1, we will have an
accuracy of more than 0.5!
25
For a dull classifier, that is a very good result and may be better than some imprecise classi-
fiers. In fact, this classifier classifies instances of a single class correctly, but the whole other
class is miss-classified. Imagine that there is a testing device that predicts whether a patient is
affected(1) or not(0) by the COVID-19 virus. Doctors and authorities would prefer a classifier
with a higher percentage of correctly classified positive cases(1) and a lower overall accuracy
score, than having a higher accuracy score with lower portion of positive cases correctly pre-
dicted. Because miss-classifying true positive cases can be dangerous, it is more logical to
privilege the portion of True positive cases than a better accuracy score.
The different metrics tend to highlight the different characteristics of each classifier. There
are different metric types for error/performance calculation. The most common ones are the
Threshold Types of Discriminator Metrics [24]. First, to get an idea of what True/False Posi-
tives/Negatives are, we will define the confusion matrix.
Confusion matrix is a n × n table, where n ≥ 2. For binary classification, n = 2 . The concept of
confusion matrix is count the occurrences of each class, for both the predicted outputs and true
outputs. Then, it compares the predictions to the actual targets [23].
Let us suppose we have 2 classes Positive(1) and Negative(0). Therefore, we have 4 values:
• True Positives(TP): They are the number of actual positives that have been correctly clas-
sified (Y = 1 and Ypred = 1)
• True Negatives(TN): They are the actual negatives that have been correctly classified
(Y = 0 and Ypred = 0)
• False Positives(FP): They are the number of actual negatives that have been miss-classified
as positives (Y = 0 and Ypred = 1)
• False Negatives(FN): They are the number of actual positives that have been miss-classified
as negatives (Y = 1 and Ypred = 0)
While each row in this matrix presents an actual class, each column each column is a predicted
class. Therefore, the matrix looks like the table 2.1 below:
Predicted Positives Predicted Negatives

Actual Positives TP FN
Actual Negatives FP TN
Table 2.1: Confusion Matrix
26
After defining each term in the table, we are able to define the terms below:
• Accuracy (ACC): The ratio of correctly classified instances over the total number:
TP+TN
ACC =
T P + T N + FP + FN
• Error Rate (ERR): The ratio of incorrectly classified instances over the total number:
FP + FN
ERR = 1 − ACC =
T P + T N + FP + FN
• Specificity (SP): The ratio of correctly negative classified instances over the total nega-
tives:
TN
SP =
T N + FP
• Precision (P): The ratio of correctly classified positives from all the predicted positives.
TP
P=
T P + FP
• Recall (R): The fraction of correctly classified positives from all the positives.
TP
R=
T P + FN
• F-Measure (FM): This measure seeks a balance between Precision and Recall, especially
for data with large number of actual negatives.
P×R
FM = 2 ×
P+R
2.1.2 Regression
Regression is a more advanced machine learning algorithm where the predicted output is a
continuous and numerous variable based on a training set.
Regression is a statistical measure that takes as input a random variables X in order to build
a mathematical relationship between them and the output variable Y. It is commonly used in
various disciples such as data finance, business and investing [36].
The most known algorithms are: Linear models, Multiple Regression, Nearest Neighbors, Ridge
Regression. . .
27
Regression is about, given a training set, learning a function h : X → Y so that h(x) makes a
good prediction for the corresponding output y [1]. H is called the model.
The most known and commonly used algorithm is the linear regression. For example, the
housing pricing prediction based on the square footage can be modeled by linear regression as
illustrated in figure 2.2.
Figure 2.2: housing price [36]
So, any new data we want to predict its output will be predicted by this model (the linear
equation in this case).
In fact, linear regression can be expressed with more than one single input. The pricing can
be linearly dependent from the square footage, distance from downtown, etc. But certainly,
linear models are not the only existing models. There are a lot more complicated and variated
models to fit with the data. It is usually needed to go beyond linear models giving the different
problems and variables.
Performance
Regression is applied to predict continuous output variables Y such as stock market movement.
Due to its nature, its prediction is very hard and it is impossible to get an exact estimation(null
error) for a new test instance [31].
This leads us to the question: when is a model accurate/performant and how is that perfor-
mance measured?
Due to the diversity of nature of regression models, is it possible to measure the performance
of all the different models?
28
The need is a performance analysis of different ML algorithms to find the best one giving the
optimal and most precise prediction by minimizing the loss [31]. As good models tend to
predict values that are close enough to the real values, performance measures tend to rely on
bias/error : | Y predicted −Y |.
The Root Mean Square Error(RMSE) is the standard statistical method used in regression. Other
methods include the Mean Square Error (MSE), the Mean Absolute Error (MAE), the Relative
Root Mean Square Error (RRMSE), the Mean Absolute Error (MAE), the Normalized Root Mean
Square Error (NRMS), the Coefficient of determination (R) [27]...
Giving the test dataset (X,Y) of n samples and the predicted output Ŷ , these are the most
common metrics of errors [27]:
• Mean-squared error (MSE): Determines how close/far are points from the fitted regres-
sion line, by taking each distance from each point and summing their squares.
1
MSE = × (Ŷi −Yi )2
n ∑i
• Root mean-squared error (RMSE): Calculates the root squares of MSE.

s
1
RMSE = × (Ŷi −Yi )2
n ∑ i
• Relative Root mean-squared error (RRMSE): Calculates the root squares of MSE divided
by the mean of observed data.
q
∑i (Ŷi −Yi )2
RRMSE = √
n × mean(Y )
• Mean absolute error (MAE): Instead pf summing the euclidean distances, it sums the
absolute errors of the samples.
∑i | Ŷi −Yi |
MAE =
n
Depending on which regression algorithm is chosen, there are some errors that are more likely
to be considered to evaluate the performance
29
2.2. Unsupervised machine learning Chapter 2. Machine learning techniques
2.2 Unsupervised machine learning
Unlike supervised learning, it does not include interaction with the user. There is no distinction
between test and train data, since all of it is unlabeled [32]. In general, there are 2 forms of
unsupervised learning: Clustering and Descending Dimension Algorithms.
2.2.1 Clustering
Clustering is a widely used exploratory data analysis technique. It consists in grouping a set of
objects based on their similarities to different groups called clusters [32].
However, the main challenge here is the absence of labeled data. So, it is difficult to train a
model, to predict new data since there is no data to train on neither a test data.
Another ambiguity is the “similarity” and clustering criteria. We know that similarity is mathe-
matically not a transitive propriety. If x1 and x2 are similar, x2 and x3 are also similar, that does
not mean that necessarily x1 and x3 are also similar and belong to the same cluster. Otherwise,
we could end up with a cluster containing the whole data [32].
So how is similarity measured and which criteria are used?
Considering a set of point represented in figure 2.3, how would be the separation of the data
and what are the observed clusters?
Figure 2.3: Non clustered sequence [32]
It is not really evident how would the cluster behave even if our intuition would go with a
cluster for the upper points and a cluster for the lower points.
Effectively, in figure 2.4, two completely different clustering scenarios from two different tech-
niques are presented:
Figure 2.4: Comparison between 2 different Clustering outputs [32]
30
The explication here is that that the first algorithm was emphasizing the not separation of close-
by points, which led to a horizontal separation (upper cluster and lower cluster). The other one
was focusing on not having two points far from each other sharing the same cluster, which led
to a right cluster and a left one [32].
Even if supposedly the clustering brought a satisfactory separation result, it does not really
characterize what each cluster is, since there is no label for the data. {2, 4, 6, 8. . . } To briefly
summarize what clustering is :
• Let X = {x1 , x2 , . . . , xn } be the data for input along with a function.

Let d : X × X → R+ be a distance function over it, such as d(x,x)=0
or s : X × X → [0, 1] a similarity symmetric function, such as s(x, x) = 1
• For the output, giving that k classes are required, we would obtain a partition of k subsets
Ci1≤i≤k ; each subset is a cluster.
Clustering is used in various disciplines: social sciences, biology, computer science where it tries
to get a first intuition about the data by identifying meaningful patterns and groups (example:
companies cluster their customers basing on their similarities for marketing purposes) [32].
One of the most common techniques of clustering is k-means.
2.2.2 Dimension Deduction
Some data have a relatively high dimension in term of features, such as pictures which are
presented as pixels and each pixel is a feature on its own [28].
For example, a 640x360 pixels picture would demand 640*360=230 400 features, where each
feature is scaled from 0 to 256 for example to describe its intensity.
You may think that is a lot, but this is only for black and white picture!
If we wanted to describe a coloured picture, each colour would be characterized by RGB rep-
resentation (3 scales: Red, Green and Blue). So that makes 3 features for each pixel, let
3 × 640 × 360 = 691200 pixels!
It is a huge number to handle.
So, working directly with such data, visualizing, analysing and storing it is very challenging
when it comes to computational performances [28].
With dimensionality reduction, it is possible to correlate the high-dimension data into a low-
dimension structure providing a more compact representation without losing considerable data.
It is a sort of compression technique like the compressed multimedia formats (mp3, jpeg, . . . )
[28]. noindent Generally, dimension reduction is performed through linear transformation to
the data, which means going from Rd to Rn . That leads us to finding the transformation matrix
W ∈ Rn×d [32].
However, it is nearly impossible to have the original data back once we applied the transforma-
tion. Thus, the old data is unrecoverable.
31
Conclusion
In this chapter, we presented some basic machine learning techniques by focusing on supervised
and unsupervised learning. The next chapter, will be dedicated to the progress of Machine
learning to deep leraning.
32
Chapter 3
About Deep Learning
Introduction
We cannot define the word Deep learning unless we know what neural networks are. Looking
at the brain’s architecture, we can very well notice its high complexity. But, how can it inspire
us to build an “intelligent” machine?
That is what introduced the “Artificial Neural Networks” (ANNs) concept.
ANNs are about analysing and visually data through layers of neurons, in a concept inspired by
the human brain [36].
Being a set of interconnected neurons, ANNs give each connection a numeric weight. The
structure resembles to building pyramids or a house of cards, as the neurons are progressively
stacked on top of each other [36].
Starting from raw data at the bottom, each layers of neurons receives information from the
previous one forming a new set of features; in a way that the data become less abstract and
more specific as the layer level increases [36].
Thanks to their structure, ANNs are able to handle complex ML tasks such as classifying billions
of images (Google images for example), generating performant speech recognition services such
as Siri or Alpha zero learning a strategy to beat the Go game world champion by simulating
millions of games against itself [23].
We will not describe the way neural networks are similar to the human neurons. We will just
focus on the way these artificial neurons function.
33
3.1. The concept of ANNs Chapter 3. About Deep Learning
3.1 The concept of ANNs
Let us describe neural networks as mathematical function, with the following components [2]:
• An input layer x
• An arbitrary amount of hidden layers
• An output layer ŷ
• A set of weights W and biases between each layer b
• A choice of activation function for each hidden layer gi
This figure 3.1 explains how a 2-layer Neural Network works:
Figure 3.1: ANNs structure [2]
We can model the Neural Network (NN) by the function y, defined as follows: ŷ = fNN (x)
This function takes observations as input and produces a decision as an output.
For example, a 3-layer neural network disposes of a mathematical function:
ŷ = fNN (x) = f3 (f2 (f1 (x))), where f2 and f3 are vector function of the following form:
fj (z) = gj (hWj , Zi + bj ), j = 2, 3
where Wj is a matrix and bj is a vector.
The activation function is fixed and usually nonlinear chosen by the data analyst in advance
before the learning. The matrix Wj and the vector bj for each layer are learned using the
gradient descent algorithm [21].
34
3.1. The concept of ANNs Chapter 3. About Deep Learning
The output of fj is a scalar. Let us suppose we have for example 5 neurons in the layer N and 2
layers in the layer N+1. The layer N+1 has the values:
yN+1 = [gN+1 (h(WN+1 )1 , Zi + (bN+1 )1 ), gN + 1(h(WN+1 )2 , Zi + (bN+1 )2 )]T

where:
• Z is 5-dimension vector(input)
• YN+1 is a 2-dimension vector(output)
• WN is 2 × 5 dimension matrix, (WN+1 )1 is the 1st line and (WN+1 )2 is the 2nd line
• bN+1 is a 2-dimension vector, (bN+1 )1 is the 1st component and (bN+1 )2 is the 2nd one.
The final activation function outputs ŷ.

Tuning the weights and biases parameters is what we called the Neural Networks training.
Each iteration of the training process consists of the following steps as shown in Figure 3.2:
Feedforward: Calculate the predicted output ŷ using the current parameters.
Backpropagation: Update the weights and biases while minimizing the loss function.
Figure 3.2: Training process [2]
We talk about Deep learning when neural networks passed more than 2 non-output layers.
Large ANNs with growing number of layers were difficult and challenging to train. But as the
computing and processing resources grew rapidly over the last years, Deep Learning took steps
forward and it has become possible to simulate more and more number of layers with huge
datasets [21].
35
3.2. Deep learning vs Machine learning Chapter 3. About Deep Learning
3.2 Deep learning vs Machine learning
Given various complex data such as images, videos, etc., deep learning (DL) algorithms are able
to automatically determine and learn representations without the need of human knowledge
or rules. The algorithm needs only raw data and with the help of its structure’s flexibility, it can
learn from the data and increase the predictive accuracy when more data is provided [3].
So, deep learning is more likely to teach a computer how to think and train to solve the problem
than just telling in how to do it directly. In this section, we discuss the main differences between
ML and Deep Learning according to different axes: functioning, data dependency, Computation
requirement, training and inference time, etc.
• Functioning: While Machine learning receives data as input, analyses it and tries to make
sense of it based on its learning experience, deep learning relies on layer-wise structure
to make intuitive and intelligent decisions using an artificial neural network [4].
• Extraction: Deep leaning structure is suitable for the extraction of meaningful patterns
and features of the data. Its layer-wise network performs hierarchically to extract the
important features and results in a progressively more abstract data representation. As
for Machine learning, extracting features is not a performant task. Instead, it relies on the
manually wise-chosen features and performs well [4].
• Data Dependency: Choosing between ML and DL depends on our data. In fact, there is no
significant difference in terms of performance between the two types when dealing with
small dataset. But as the data gets larger, the ANNs structures take advantage of its large
network depth to handle the large amount. Deep learning is called “Data Hungry” [4].
This figure 3.3 illustrates this dependency well:
Figure 3.3: ML and DL Data dependency [4]
• Computation Power: A traditional ML algorithm only requires a machine with fairly de-
cent CPU. However, deep learning does not only depend on the data size but also on the
depth of the network. A GPU (graphical processing unit) offers much more cores and thus
more computation power [4].
36
3.3. Classification of ANNs Chapter 3. About Deep Learning
• Training and Inference Time: Having the privilege of the layer structure does not come
costless. In fact, training a Deep learning algorithm can take hours, days and even months.
As the data size increases, not only the training takes more time but also testing. The
increasing number of layers also causes the number of parameters for each layer: the
weights. Machine learning on the other hand takes relatively shorter time but it still has
limitations in term of data size support [4].
• Problem-solving technique: The machine learning approach must follow different parts:
objects recognition, analysing the object, etc. and then finally applying the algorithm to
the convenient features. However, deep learning just takes the input data along with the
labels and does the rest of the job on its own [4].
• Industry Ready: Decoding ML-based products is very doable and they are interpretable in
terms of parameters used. But Deep learning algorithms are a “black box”; even if they
are sometimes capable of surpassing human performance, they are not industry-wise in
terms of reliability. ML algorithms such as decision trees and random forest are more
likely to be used in banking systems for example [4].
• The output: Machine Learning systems outputs are usually numerical (score, classifica-
tion). Whereas Deep learning ones can also produce text, speech, etc [4].
3.3 Classification of ANNs
Since deep learning is based on complex ANNs, it is necessary to study and classify the neural
networks according to their structure and layers.
3.3.1 Multi-Layer Perceptrons
A perceptron is a simple ANN structure based on an artificial neuron called threshold logic unit
(TLU), or a linear threshold unit (LTU) [23].
The inputs and output are numbers and input connections are associated with a weight to make
a linear combination:
z = w1 · x1 + w2 · x2 + +wn · xn = xT .w
Then, a step such as the Heaviside function is applied to z [23].
37
The output is given by the figure 3.4 below:
Figure 3.4: How does a perceptron function [23]
This simple TLU can be used for classification: binary or multiple.

For example, we can classify an article price as cheap, moderated or expensive depending on the
inputs the value of the price (X1 ), its lifetime (X2 ) and quality (X3 ) (All inputs are given by real
numbers). A Multi-layer perceptron (MLP) is a structure of ANNs containing [23]:
• one input layer
• one or more TLUs layers (the hidden layers)
• one final layer of TLUs (the output layer)
Figure 3.5: Multi-layer Perceptron structure [23]
38
3.3.2 Recurrent Neural Networks
RNNs make a good model for problems related to sequence data such as speech recognition
[5].
In order to realize the importance of RNNs, we consider the example of a ball moving in time.
The ball in the initial position is shown in figure 3.6:
Figure 3.6: Intial ball position [5]
With only information about the current position, it is random to say which way the ball will
move. If we consider a sequence of previous positions, we can much easily predict the next
position as shown in figure 3.7.
Figure 3.7: the prediction of the next position [6]
Imagine series of input data. If every input item (each position in the previous example) is
treated with its own neural network independently from the others items, we would miss the
real purpose of sequenced data. So, the relationship across inputs are highly required [6].
39
In figure 3.8, the separate processing of each input is highlighted:
Figure 3.8: Separate input networks [6]
Recurrent Neural Networks rely on the memory, storing what it has learned from previous
inputs. The training includes the current fed input but also the experience from prior input. In
addition to the applied weights to input, a “hidden” state vector representing the context based
on prior input(s)/output(s) is included in RNNs. The hidden state itself is updated each time
more input is fed.
Hence, RNNs can learn similarly while training and can remember previous learned experiences
while generating outputs [6].
The figure 3.9 below shows how a RNN works on inputs (Xi )I by using hidden state vectors to
generate outputs (yi )i :
Figure 3.9: The RNN structure [6]
40
Consequently, a RNN uses a looping mechanism that acts as a highway allowing information
to flow from one step to the next with the help of a hidden state which is a presentation of
previous states [5].
Figure 3.10: The looping mechanism of a RNN [5]
3.3.3 Auto-encoders
In order to reduce the data dimension while preserving its features saliency, auto-encoders
are implied. Auto-encoders consist in embedding the data {xi }1<=i<=n into a lower dimension
space via linear function
f: Rk → Rk , with k ≤ d and another linear function g : Rk → Rk , such that the difference
between g( f (xi )) and xi is minimized.
f(x) = hWf , xi and g(h) = hWg , hi , W f ∈ Rk×d and Wg ∈ Rd×k
f is called encoder, while g is called decode (see figure 3.11):
Figure 3.11: Auto encoder structure [25]
41
Conclusion
In this chapter, we presented the main concepts behind the construction of an Artificial Neural
Network, the structure of deep learning techniques. We have also revealed the most important
differences between the approaches of ML and DL. Besides, we have explained the different
structures of Neural networks.
The next chapter will be dedicated to the learning methods that we chose to deal with sentiment
analysis.
42
Chapter 4
Sentiment Analysis by using Machine

Learning Techniques
Introduction
In this chapter, we present in detail two well known Machine Learning techniques that we
will adopt in practice for our problem : sentiment analysis. These techniques are : Logistic
Regression and Support Vector Machine.
4.1 Sentiment Analysis
Sentiment Analysis is classifying a data related to natural language as positive or negative. In

this particular project, we chose to work with twitter, since it is a rich source for hundreds of
millions of tweets per day. Besides, it has been the subject of many sentiment analysis studies
such as the prediction of the USA latest elections winner or the result of federal elections in
Germany based on a large set of tweets. Twitter has therefore proved its efficiency in reflecting
the sentiment. In order to achieve this, we will be using several techniques such as: Logistic
Regression and Support Vector Machine.
4.2 Logistic Regression
It is important to first clarify that Logistic regression is a generalized linear classifier.

Linear learning predictors are vastly used on learning problems due to their ability of learning
efficiently in various cases, intuitively, tendency to fit the data in many neutral learning prob-
lems [32].
In this section, we will introduce the basic concepts of linear model, logistic regression, the type
of classifications as well as the parameter tuning for such model.
43
4.2. Logistic RegressionChapter 4. Sentiment Analysis by using Machine Learning Techniques
4.2.1 What is a linear model?
Given a D-dimensional vector x of input variables and an output variable Y, a linear model is
represented as follows:
Y = fw,b (x) =< w, x > +b
where w is a D-dimensional vector of parameters and b is a real number : the model is charac-
terized by w and b.
Two different values of the couple (w,b) would give different outcomes for the same input sam-
ple. Considering the nature of linear models and the main aim of machine learning, the desired
model produces the optimal values (w*,b*) to make the best predictions [21].
4.2.2 What is Logistic Regression?
Logistic regression is not exactly a linear model given the definition above. It is more likely a
part of a bigger family of models: generalised linear models. Our problem does not require a
continuous output variable. We want to predict whether a tweet sentiment is positive, negative
or neutral: this is a multi-class classification [21].
Logistic regression is not really a regression algorithm but instead is a binary-classification
model. Although the desired model requires 3 output classes, we can rely on logistic regression
[21].
The easiest way to model a logistic regression problem is giving binary values: 0 to negative
and 1 to positive. Let f be a continuous function, applied to the linear model, whose co-domain
is (0, 1) [21].
Since the output variable is binary: if the value returned by the model for an input x is closer
to 0, then we assign a negative label to x; otherwise, the example is labelled as positive. (f(xi )
could be assigned as the probability of having yi = 1). The function f is the standard logistic
function defined as:
1
f(x) = (1+exp(−x)) , also known as the sigmoid function.
So, our desired output is given by:
1
y = gw,b (x) = f(fw,b (x)) =
(1 + exp(−hw, xi + b))
44
In figure 4.1, the corresponding curve is illustrated:
Figure 4.1: Sigmoid function curve [21]
Given a good optimization of (w,b) parameters, the output f(< w, xi > +b) could be interpreted
as the probability of < w, xi > +b being a positive, and thus giving the suitable binary classifi-
cation for the input x. The most common case is using the threshold t=0.5 for classification.
Notice that xi is labelled positive if yi = 1, otherwise xi is labelled negative.
4.2.3 Parameter Tuning
The aim now is to determine the optimal values (w*,b*) for the best results. For a linear model
(y =< w, x > +b), we obtain the best values (w*,b*) by minimizing the mean squared error
(MSE) [21]:
1 N
MSE = × ∑ (yi − exp(hw, xi i + b))2
n i=1
For the linear regression model, we will use the maximization of the likelihood.
Let X = (x1 , x2 , . . . , xn ) be the chosen sample and Y = (y1 , y2 , . . . , yn ). The likelihood function of
our model given the outcome (X, Y) with N independent observations is:
N N
∏(P(w,b)(Yi = yi)) = ∏((gw,b(x))yi × (1 − gw,b(x))(1−yi))
i=1 i=1
45
In practice, it is more convenient to maximize the log-likelihood instead of likelihood. Log-

likelihood is defined as:
L = ln(L(w, b) | (X, Y))
= ∑N
i=1 yi × ln (gw,b (xi )) + (1 − yi ) × ln (1 − gw,b (xi ))
= ∑N
i=1 yi × ln(gw,b (xi )) + (1 − yi ) × ln(1 − gw,b (xi ))
= ∑N
i=1 yi × ln(gw,b (xi )/(1 − gw,b (x i))) + ln (1 − gw,b (xi ))
= ∑N
i=1 yi × (b + hw, xi i) − ln(1 + exp(hw, xi i + b))
Let us calculate the partial derivate of the log-likelihood with respect to each variable:
∂L N (xi )j ×exp(hw,xi i+b)

• dwj (w, b) = ∂ wj (w, b) = ∑i=1 [yi × (xi )j − ln(1 + exp(hw, xi i + b))− 1+exp(hw,xi i+b) ] ,
where (xi )j is the jth variable of the sample i, j=1,...,d
∂L N exp(hw,xi i+b)
• db(w, b) = ∂ b (w, b) = ∑i=1 [yi − 1+exp(hw,xi i+b) ]
Given the maximum likelihood method :
dwj (w∗, b∗) = 0

db(w∗, b∗) = 0
To simplify our notation we will consider w a (D+1)-dimension vector instead, where

w = (w1 , . . . , wd , w0 ), and w0 = b
Then, each data sample X would become a (D+1)-dimension vector with d variables:
X = (1, x1 , . . . , xD ) Then, the new notations would be:
1
y=
(1 + exp(−hw, xi))
N
L = ∑ yi × (hw, xi i)− ln(1 + exp(hw, xi i))
i=1
N
dw(w) = ∑ [xi × (yi −gw (xi )](∗)
i=1
(each xi is a (D+1)-dimension vector)
We obtain a nonlinear system from (*) given the calculated, by maximum likelihood, value of
w. The system generates D+1 equations nonlinear in w. Since the first component of each
sample xi is always 1, the first score equation specifies that [37]:
N N
∑ yi = ∑ gw(xi)
i=1 i=1
46
This result is equivalent to saying that number of predicted class ones matches the number of
observed ones (the same for the class of zeros) [37]. The other D equations are [21]:
N N
∑ (xi)j × yi = ∑ gw(xi), j = 1. . . D
i=1 i=1
It is not possible to solve this minimisation problem manually. Numerical methods such as
Newton-Raphson and gradient descendent are required to solve it.
Gradient descendent
Gradient descent also known as Steepest descent, is an optimization technique to minimize

multidimensional smooth convex loss functions (cost functions) given by the form [35]:
J : Rn → R
x 7→ J(x)
Given an initial location x0 ∈ Rn , we iterate the variable x, such as each iteration t > 0 is:
xt+1 = xt − αt ∇J(xt ).
αt is called the scalar step size for iteration t.
However, it is almost impossible to get an exact value of xt , where ∇J(xt ) = 0. So, a consider-
ably small value of ∇J(xt ) we be fair to consider stopping the loop. The general method used is
implemented by the algorithm below:
Algorithm 1: Gradient descent algorithm

Input: Initial point x0 gradient norm tolerance ε
1 Set t = 0
2 while ||∇J(xt )|| ≤ ε do
3 xt+1 = xt + αt × ∇J(xt );
4 t = t + 1;
5 end
6 return: xt
47
We can see how the process involves with sample example in figure 4.2(n=1):
Figure 4.2: Gradient descent explained in one dimension [7]
However,it is not always the case to have J as a convex function. Minimizing an arbitrary
function (not necessarily compact) might lead us to many local minima, unlike the case of a
convex function where a null gradient equals a global minimum. A simple example in the figure
below illustrates clearly this case:
Figure 4.3: Local minimum 6= Global minimum [7]
Back to the step αt , there are several methods to compute it which lead to different conjecture
ways. However, we will not be looking further into the details.
• Constatnt step size [35]: αt = α it may seem not precise but it is simple and fast to
sompute.
• Exact Line Search [35]: choosing αt , such as: αt = argminα (J(xt − α × ∇J(xt ))).
• Decaying Step size [35]: Instead of performing a line search each iteration, it is possible
to choose a decreasing step size over the iterations, for example: αt = √1t
• Many other criteria
48
Newton-Raphson method
Gradient descent employs rely on first order information, but its convergence rate is slow. New-
ton method’s main idea is to use second-order information (hessian matrix) in addition to the
first-order (gradient vector) to solve the minimum optimization of the objective function [33].
This process is repeated until the updated variable converges.
Let:
J : Rn → R
be a quadratic function, twice differentiable.
As gradient descent method, Newton method is also iterative and the algorithm stops at a cer-
tain condition.
Let x∗ be the minimum. For an iteration t :
J0 (xt )
xt+1 = xt −
J00 (xt )
More likely in general:
xt+1 = xt − (∇2 J(xt ))−1 · ∇J(xt )
We can also introduce a learning rate αt :
xt+1 = xt − αt × (∇2 J(xt ))−1 · ∇J(xt )
αt , as introduced in the gradient descent method, can be computed by several methods.

Our algorithm is then implemented as shown :
Algorithm 2: Newton Method algorithm

Input: Initial point x0 gradient norm tolerance ε
1 Set t = 0
2 while ||∇J(xt )|| ≤ ε do
3 Compute pt := −(∇2 J(xt ))−1 · ∇J(xt );
4 Compute α = argminα J(xt + αpt );
5 xt+1 = xt + αpt ;
6 t = t + 1;
7 end
8 return: wt
This method can be exhaustive in term of computing time and resources especially for high
dimensions as it calculates the hessian matrix for every iteration [33].
49
4.3. Support Vector Machine
Chapter 4. Sentiment Analysis by using Machine Learning Techniques
4.3 Support Vector Machine
Support Vector Machine (SVM) is a machine learning model, this model is used to perform
linear and nonlinear classification, regression, outlier detection. . . . SVMs are generally used
for classification of complex, small or medium sized datasets.
4.3.1 Hard-margin linear SVM
In order to fully understand the fundamental idea behind SVM, let us take a look at some pic-
tures:
Figure 4.4: Different classifiers of the data set [23]
This figure 4.4 showcases a dataset of iris flowers, each element of the data set is represented
individually by its petal length and petal width.
We are looking for a classifier able to separate the 2 classes of flowers: IrisVersicolor and Iris-
Setosa. We can observe clearly that the two classes can be easily separated with a straight line
(linearly separable). The figure on the left showcases the decision boundaries of three possible
linear classifiers. The model with decision boundary represented by the dashed green line is
not suitable since it doesn’t even separate the two classes properly. The rest of the models work
perfectly on this training set, but their decision boundaries are close to the instances. They
could not perform as well on different instances [23].
However, the figure on the right is another case. The continuous line in the plot represents the
decision boundary of an SVM classifier.
This solid line gives us 2 privileges: a clear visual separation of the two classes and keeping
a fairly important distance away from the closest training instances of the two classes (the
furthest possible). It is as if our algorithm is trying to fit the widest possible street between the
classes. This method is called large margin classification. Adding more training instances off
the street will always not affect the decision boundary. It is fully supported by the instances
circled in the figure, called support vectors [23].
50
SVMs are extremely sensitive to the feature’s scaling. We can observe in the figure below for
the left plot how large is the vertical scale compared to the horizontal one. Thus, the widest
possible margin is almost horizontal. If we normalize our data by Feature Scaling, the decision
becomes better.
Figure 4.5: margin comparison between Scaled/unscaled data [23]
But first, we must learn the theory behind SVM and how is it explained. What is this idea of the
margin we were talking about?
As logistic Regression, SVM is a binary classification algorithm. Support Vector Machine is bases
on the geometrical approach of classifying the data, relying on concepts like inner products and
projections. This model doesn not propose a probabilistic view of the data distribution derived
from an optimization problem. It is about designing a particular function to optimize during
training, based on geometric intuitions [28].
As encountered in the iris case data, the data are separable by a hyperplane (for vector space
of dimension D, a hyperplane is an affine subspace of dimension D-1 [28]). To put it more
simply, if the dataset disposes of 1 input feature, the hyperplane would be a point. If we have 2
features, the hyperplane is a line. If there are 3 features, the hyperplane would be nothing but
a plane. . . That hyperplane would up separate the 2 classes.
First, we know that a hyperplane is defined as:{x : hxT , β i + β0 = 0}, where β is a vector normal
to the hyperplane and β0 is a scalar [28].
A sample can be classified as positive or negative depending on its side of the hyperplane. Let:
f : RD → R
x → hxT , β i + β0
Our hyperplane is then defined as {x : f (x) = 0}. To classify a sample x, we simply calculate
f (x). If f (x) > 0: the sample is classified as +1, otherwise -1. It is the same as saying that the
point is "above" or "below" the hyper plane[28].
Let us consider a dataset (x,y), where x is the input and y ∈ {−1, 1} is the output. Let g(x) =
f (x) ∗ y for each sample (x,y).
g(xi ) = (hxT , β i + β0 ) × yi , i ∈ [1, N]
If g(xi ) > 0, then our prediction is correct [29]. In a linearly separable dataset, a margin r is the
distance of the separating hyperplane to the closest examples in the dataset [28].
51
Given a training set {(xi , yi ), i = 1, ..., N}, we can define the function margin of (w,b) with respect
to the dataset as the minimum functional margins of the training samples [29]:
β β0
r = min (hxi , i+ ) × yi | ||β || = ||β ||2
i=1,..,N ||β || ||β ||
Besides, our objective is to have our training points further than r from the hyperplane, each
point it the corresponding direction according to its label (positive or negative), which means
for every i ∈ [1, . . . , N] [28]:
(xi β + β0 ) × yi ≥ r, i = 1, ..., N
0 0
If we consider (β , β0 ) = (αβ , αβ0 ), then only the magnitude of the margin would change, but
it won’t affect the decision of the classifier in any way (g(x) is basically still the same sign).
Intuitively, resorting to a normalisation such as ||w||2 = 1, we might consider [29]:
β
β←
β0
β0
β0 ←
w
Reconsidering all these requirements, we would constrain a single optimization problem [28]:
max r (1)
r,β ,β0
subject to h(xi , β i + β0 ) × yi ≥ r, i = 1, ..., N

||β || = 1
This means that we want to maximize the margin r and keep the data on the correct side of the
hyperplane [28].
A more convenient assumption than taking ||w|| = 1 would be to choose a scale such as:
(hxc , β i + β0 ) × yc = 1 = r [28], where (xc , yc ) represents the closest sample to data, whether
from one side or another (it is the same distance).
This condition leads us to r = ||β1 || . The optimization problem is then described as follows:
1
max (2)
β ,β0 ||β ||
subject to (hxi , iβ + β0 ) × yi ≥ 1, i = 1, ..., N

||β || = 1
52
Since maximizing ||β1 || is the same as minimizing 12 ||β ||2 , we will transform our problem to an
efficiently solvable one. It is a maximum margin optimization problem with a convex quadratic
objective and linear constraints [29]:
1
max ||β ||2 (3)
β ,β0 2
subject to (hxi , β i + β0 ) × yi ≥ 1, i = 1, ..., N

Optimisation problems (1) and (3) can be proven to be equivalent.
4.3.2 Soft Margin linear SVM
Definitely the example we have encountered shows the simplest case, the one perfectly linearly
separable. But it is not really practical since datasets are generally large and must have some
imperfections. In fact, we may face two main issues illustrated in the figure 4.6 below regarding
the “street” which is our margin:
Figure 4.6: Main issues regarding SVM linear regression [23]
• We may have at least one extra outlier on the wrong side of the margin.
• The decision boundary ends up very close to the blue dots because there is a yellow dot
(probably misclassified) which is much closer to the blue dots. So, the “street” would be
very close to the blue dots that a new blue dot would probably be misclassified as yellow
even if it is much closer to the blue ones.
In order to avoid these issues, it is advisable to resort to more flexible models. The objective
is to find the right balance between keeping the property largeness of “street’ and minimizing
the margin imperfections (misclassifications), whether they are instances in the middle of the
street or on the wrong side [23].
53
We can see in this figure 4.7 [23] two extreme cases, one that privileges a wider street, while
the other goes for minimizing the number of misclassifications [23] (C is a parameter in the
SVM algorithm to determine to which extent one of the constraints is taken into consideration):
Figure 4.7: Comparison between both cases [23]
The main challenge is finding a compromise between the 2 constraints. In this situation, the
first classifier seems to generalize better since it makes fewer misclassifications (most of the
margin violations ended up on the correct side of the decision boundary; a thinner street does
not make an issue as concerning as an important number of prediction errors) [23].
Theoretically speaking, how can we manage to search for this compromise?
Our goal is still to maximize the margin, but somehow under different circumstances since it is
allowed for some points to be in the margin on the wrong side of the margin [37].
We first define the slack variables: ξ = (ξ1 , ξ2 , ..., ξN ) which represent the error amount if any
for each instance. The new quadratic problem is described as:
1
min × ||β ||2 + C × ∑ ξi (3)
β ,β0 2
subject to (xi β + β0 ) × yi ≥ 1 − ξi , i = 1, ..., N

ξi ≥ 0
The parameter C>0 trades off the margin size and how likely are the slack variables to influence
the optimization. C is called the regularization parameter or the cost parameter [28].
The term ||β ||2 is called the “regularizer”. A large value of C means low regularization, giving
the slack variables more weight, which means giving priority to misclassified instances [28].
To resolve the optimisation problem, it is necessary to compute the Lagrange optimal function
[37]:
1
Lp = × ||β ||2 + C × ∑ ξi − ∑ αi × (h(xi , β i+0 ) × yi − (1 − ξi )) − ∑ µi × ξ i
2
54
We minimize L p with respect to β , β0 and ξi ,we obtain by setting the respective derivatives to
0:
β = ∑(αi × yi ).xi
∑ αi × yi = 0
αi = C − µi , i = 1, ..., N
From L p , a minimisation problem for the dual objective function LD is derived [37]:
1
min LD = ∑ αi − × ∑∑(αi × αj × yi × yj) × (hxT
i , xj i), α = (α1 , ..., αN ) (4)
α 2
subject to: ∑ αi × yi = 0
0 ≤ αi ≤ C, i = 1, ..., N
After obtaining the dual parameters α ∗ , the values of β ∗ and β0∗ are computed as follows [28]:
β ∗ = ∑ αi∗ × yi × xi
β0∗ = yi − < β ∗ , xi >
4.3.3 SVM in Python
Figure 4.8 below shows an example of SMV on iris dataset. Support Vector Machines are called
SVC in python [8].
Figure 4.8: SVM implementation on Python [23]
After loading the data, it scaled(noramlized) and the linear SVC classifier is applied to fit the
data. Then, a new sample is fed to the model to predict the output.
55
4.3.4 Non Linear SVM
SVM classifiers may be efficient in some cases. However, the datasets are not always linearly
separable; sometimes it is just impossible to find a hyperplane that produces satisfying results.
It is possible to handle non linearly separable datasets by adding more features, like polynomial
features [23].
In this example of figure 4.9 below, our 1-dimensional dataset is not separable (we can’t find a
single point (threshold) that is capable to separate the 2 classes).
Briefly, we have added a new feature x2 = (x1 )2 . Now that we have 2 features, plotting the new
data makes the data look perfectly linearly separable.
Figure 4.9: Effect of adding features on linear separability [23]
Considering a set of new features φ (x) (φ functions can be nonlinear) provides us flexibility to
construct SVM classifiers that are nonlinear in xi 1≤i≤N [28]. This method lets us achieve better
training-class separation, and obtain nonlinear boundaries in the original space [37]. Let
φ (x) = [h1 (x), ..., hm (x)], x ∈ RD initially
56
To explicit this nonlinear feature map φ (.), a similarity function k(., .) is defined as [28]:
k(xi , xj ) =< φ (xi ), φ (xj ) >
The k functions are called kernels and the resulting matrix K from their inner products is called
Gram Matrix or Kernel Matrix.The nw prediction function would then become:
f(x) = hφ (x), β i + β0 , β ∈ RM
There are various types of kernels. Some of them are represented in figure 4.10 with their
separation boundaries:
Figure 4.10: Different Kennels representation [28]
The mathematical details about kernels and nonlinear SVMs are a bit complicated. However,
the passionate lectures are invited to consult more details about them and about the previous
optimisation results in the resources [35], [28], [37] and [29].
Conclusion
Obviously, each machine learning model is complex to implement. But this chapter should give
an overview of the fundamentals of how each Machine Learning algorithm we used works.
57
Chapter 5
Natural Language Processing
Introduction
As mentioned in the previous chapter, our problem relies on sentiment data that are provided
by social nets Twitter. Thus, our inputs are in natural language and require a specific processing
through different tasks : tokenizing, word Normalization, Lemmatization and Stemming and
sentence segmentation.
5.1 Natural Language Processing
It is clear that Machine learning became widely used in real life problems and its use increased
due to the increasing amount of data and computing resources. One of the most popular and
unique aspects of ML is Natural language processing(NLP).
Natural language refers to the language of everyday use, the one we use communication: Ara-
bic, English, French and literally any spoken or written language language. Through the years,
natural language has evolved and became harder to pin down explicitly like programming lan-
guages and mathematical notations [7].
In this context, Natural Language Processing has emerged as the technology to help computers
understand the human’s language. NLP is basically a branch of AI dealing with the interaction
between computers and humans to make sense of the natural language. NLP techniques are
mostly reliable on machine learning [33].
These techniques are vastly applied in different levels. On one hand, it can be about a simple
word counting or frequency. On the other hand, NLP is capable of understanding human lan-
guage sentences and utterances, and giving useful answers to different high-complex structured
phrases [7].
58
5.2. Tokenizing Chapter 5. Natural Language Processing
So, Natural language processing is a set of algorithms to identify and extract human language
rules and convert it to in order to be understandable by a computer [33]. Text normalizing pro-
cessing relies mainly on three basic tasks: Tokenizing, word Normalization, Lemmatization and
Stemming and sentence segmentation. These tasks will be detailed in the following sections.
5.2 Tokenizing
It is the task of segmenting running text into words/terms. The set of words in the text(tweets)
is noted V, and the number of types is the word token vocabulary size |V |, referring to the
number of unique words. Tokens are the total number N of all the running words including
repetitions [37]. Punctuation, numbers, web links and special characters. . . are usually re-
moved, but can be kept aside for the processing. It highly depends on the context and the
algorithm philosophy. So, tokenization is essential to facilitate the use of deterministic algo-
rithms. Keeping and extracting the main data from text is usually based on regular expressions
[37]. There is a package for regular expressions that is very helpful in dealing with text and
keeping the needed part of it.
5.3 Word Normalization, Lemmatization and Stemming
Word normalization is grouping words/tokens with different forms in a normalized format

such as USA/US, uh-huh/uhhuh. . . The normalization loses track of spelling mistakes, but can
be helpful in obtaining a normalized and a smaller amount of tokens [37]. Another form of
normalisation is to lower case all letters to have same words in the beginning and any other part
of the sentences be represented identically. However, sometimes it may not be such a good idea
because it results in important information loss such as the difference between the pronoun us
and US referring to United States [37].
Lemmatization and stemming are about assigning the sale root for similar words. For example:
am, are, is, be. . . table, tables. . . belong to the same set.
Both processes are often confused and merged into one. But, there is a difference between them
in terms of how to transform the original token/word. Stemming eliminates the end of the word
of the derivational affixes in hope of obtaining the desired set. However, lemmatization handles
things more properly by analysing the vocabulary the words morphology, returning the base or
dictionary form of the word called Lemma.
Example: Stemming transforms “saw” into “s”, whereas Lemmatization transforms it into “see”
[34].
59
5.4. Sentence segmentation Chapter 5. Natural Language Processing
5.4 Sentence segmentation
Sentence segmentation is the process of dividing a text into individual sentences. Often, some
punctuation is the key of determining the end of a sentence. Generally, a sentence-ending
punctuation decides the end of a sentence, there for it mustn’t be grouped to a sentence token
(For example a period “.” in Mr. or U.S.A is not a sentence-ending punctuation). They must be
chosen carefully.
However, these steps alone do not let us directly apply the machine learning algorithms. It is
important to apply vectorizing after these steps.
In this project, we will use the bag of words (boW) model. It is the simplest representation of
text feature extraction. The BoW comes from the intuition that documents are similar because
they have similar content. It extracts features from the text data describing the occurrences/fre-
quency scores of words/terms. The term bag refers to the non-intrest in the order or structure
of the words. Grammar does not intervene [9].
For each vectorizer, a number of maximum features(terms) is assigned. For example if the max-
imum number of terms is K, the vectorizer will only consider the K mostly used terms in the
dataset. The other ones are ignored and are not considered in the train or test.
We kept using "terms" instead of "words" because an approach consists of considering a vo-
cabulary of group words called N-grams. An N-gram is an N-token sequence of words, where
each token consists of N adjacent existing words [9]. Logically, working with 1-gram is simply
assigning 1 term to each word.
Then, a matrix is created depending on those tokens and the dataset. It is a matrix of (n=number
of instances) lines and (K=number of tokens) columns. The position (i,j) in the matrix (1<=i<=n,
1<=j<=K) refers to what the vectorizer assigns to the term j for the sentence i. In other words,
that position is filled with the frequency or a customized score of that term in that sentence.
We will see the difference between the scores of the vectorizers. Each instance is therefore
presented as K-sized line vector.
So, we were able to extract a numerical representation of the text data. The 2 mostly used
vectorizers; which will be considered during this project; are Count Vectorizer and and TFIDF
Vectorizer.
Count Vectorizer
The score calculated in count vectorizer is the simple frequency of each term in each instance.
A simple example can clarify the idea.
Let us consider these 3 sentences representing our data: ["I love my dog who escaped", "My
dog is black", "My dog escaped"]
60
The uni-grams list is simply the list of all existing unique words: ["my", "dog", "I", "escaped",
"who", "love", "is", "black"].
It is the standard version of BoW.
The bi-grams are ["I love", "love my", "my dog", "dog who", "who escaped", "dog is", "is black",
"dog escaped"]
The tri-grams are ["I love my", "love my dog", "my dog who", "dog who escaped", "My dog is",
"dog is black", "My dog escaped"]
The matrix representing the data for uni-grams is:
 
1 1 1 1 1 1 0 0
1 1 0 0 0 0 1 1 
1 1 0 1 0 0 0 0
The same procedure goes with bi-grams and tri-grams by changing the number of columns ac-
cording to the number of terms and calculating the occurrences again.
TFIDF Vectorizer
TF-IDF stands for Term Frequency-Inverse Document Frequency . It is in fact the prodcut of 2
terms: TF and IFD, such as:
• T Ft,d = count(t,d)
TF is the count of the term t in the document(tweet) d. This term can be minimized to
avoid large numbers by applying the 10-base logarithm to it. To avoid applying log10 to
0, we can consider the following formula:
T Ft,d = log(count(t, d) + 1) [26].
• IDFt = log( dNft )

N is the total number of documents and d f t is the number of documents in which the
the term t occurs. This terms gives importance to the terms that are not frequently used
because they tend to differentiate more the documents from each other and thus can give
more importance in classifying the corresponding documents [26].
To get a clearer idea of this process, let us consider this example of a set of sentences using
uni-grams again:
{1:"I ate my cake and I slept", 2:"My mother cooked this cake", 3:"You ate the cake I cooked"}
Our words are: {1: "I", 2: "ate", 3: "my", 4: "cake", 5: "mother", 6: "cooked", 7: "this", 8: "you",
9: "the", 10: "and", 11: "slept"}
61
Let us suppose we won’t use the logarithm for TF, because our example data is quite small.
Below are the calculus of 2 positions chosen randomly:
T FIDF1,1 = T F1,1 ∗ IDF1 = 2 × log( 23 ) = 0.352
T FIDF2,3 = T F2,3 × IDF2 = 1 × log( 23 ) = 0.176
The same calculus are computed for each T FIDFi, j , 1 ≤ i ≤ 3 and 1 ≤ j ≤ 11 in order to obtain
the representation matrix (T FIDFi, j )i, j .
If N = d f t , IDF = log(1) = 0, which means that term does not really provide supplementary
information to differentiate documents.
The rest of the process is the same as count vectorizer. The data can be tockenized with uni-
grams, bi-grams, tri-grams,etc. The only difference is the value given for each term (calculated
by TFIDF instead of simple frequency).
Let us notice that for larger datasets, the number of terms gets alot larger. It would not be
interesting to take all the available terms into consideration. It is more convenient to set a
maximum number of features.
Conclusion
We have covered in this chapter many concepts and applications of Natural Language Process-
ing. Although there are many resources available exclusively in other languages, there are still
some interesting tools that we found for our project.
62
Chapter 6
Implementation and Simulation Results
Introduction
After introducing the general Machine Learning fundamentals and the most common tech-
niques, and after highlighting the principales of Natural Language Processing, we can start
explaining the solving process of our problem the problem. We will present the whole pro-
cedure of a Machine Learning project from getting the data to analyzing and interpreting the
results.
In this chapter, the whole procedure was coded coded with the multi-platform and multi-
paradigm programming language Python version 3.7 on the framework Jupyter Notebook. on
Jupyter Notebook. The different code blocks are illustrated and followed by descriptions of each
step each step. Each of the following sections highlights a fundamental step of the developped
learning procedure. Thus, the order of these steps is crucial as explained in these sections.
6.1 Data Pre-processing
The first step is to analyze the structure of our Dataset and the different features. The training
dataset contains :
1. Sentiment : 0 for negative sentiment, 4 for positive sentiment
2. The tweet’s ID
3. The date and time of the Tweet
4. The presence / not presence of query
5. The nickname of the user who tweeted
6. The content(text) of the tweet
63
6.1. Data Pre-processing Chapter 6. Implementation and Simulation Results
To import our database, we use the Pandas library whose documentation is provided in [10].
Then, we read the file "training.1600000.processed.noemoticon.csv" in the work directory.
The dataset contain 1 600 000 samples. The first 50% of the training dataset are labeled
as negative, while the other 50% as positive. We can assign to each feature (column) the
corresponding name. Then, we can check the first 5 features of our data.
In the following code block, we can clearly notice that only the sentiment label and the text
label are essential. We can get rid of the rest of the features. Then, to simplify the dataset, we
will replace the positive sentiment label 4 by 1.
6.1.1 HTML Encoding
We notice that some tweets contain few bizarre characters such as "amp" and "quot"...
While importing the tweets, html encoding has left these characters that need to be deleted. We
will use the function BeautifulSoup from the library bs4 to clean the data from this encoding.
Beautiful Soup is a library that allows us to deal with information on web pages [11].
64
In the following code block, we show an example of the original version of one tweet, and the
new treated version of it.
6.1.2 Mentions and hashtags
Some tweets contain some mentions of other Twitter users (@user) or some hashtags. We will
delete these mention signs and the names of the mentioned users; as they have no importance
or impact on sentiment analysis process; using the function re.
However, the content of hashtags could have important information. So, we will keep it and
just get rid of the hashtag symbol.
A RegEx, or Regular Expression, is a sequence of characters that forms a search way. It can be
used to check whether a string contains the specified search pattern or not [12].
65
6.1.3 URLS
The URLs in the tweets also have no influence on the sentiment analysis process. We will delete
the links using again the library re.
6.1.4 cp1252 encoding
Some tweets contain the following code sequence cp1252 in the tweets as shown in the code
block below. It is the cp1252 encoding for the replacement character, a symbol inserted to
replace a non recognized or inappropriate character. We will replace it by a ’?’ character.
6.1.5 Data Clean
Let us apply the data cleaning functions and merge them into one. However, more changes
need to be done on the cleaning functions.
• We can see that negation words lost their meaning when we remove the special characters
including " ’ ". Example : Can’t becomes can t.
So, the word "can" may give positive impression while it should have done the opposite.
Therefore, we should treat the cases with such ambiguities.
• Some websites also don’t always include a "http" reference at the start of the website
address. Some website addresses published in the tweets go like : "www.google.com".
We should also take that into consideration. The used method only detects alphabet,
number, period, slash. Some other special characters are not detected like "=", "_", " ",
etc. They should be also be deleted.
66
• Another minor detail should be considered regarding the mentions of twitter IDs : Some
IDs main contain the character "_". So, the ’@[A-Za-z0-9]+’ won’t be enough.
We will use the function WordPunctTokenizer from nltk.tokenize associated to the library Nat-
ural Language Toolkit(nltk), built to work with human language. More about the library is
detailed in the documentation [13].
After defining the clean function as shown in the following code blocks, we will execute the
treatment on the first 100 samples and check 6 of them in order to not overload the screen.
67
In the following code block, we can see how the function perform the data clean on these six
examples as anticipated.
Now, let us apply this data cleaning function to our whole training dataset. Since the dataset is
too large and takes a long processing time, we will make checkpoints for every 50 000 samples.
We will only show the verification on the first 750000 tweets in order to not overload the screen.
We can then check at the end of the processing that all the tweets have been successfully treated.
68
Then, we will convert the cleaned data with the corresponding labels into a .csv file.
Some tweets are empty after the data pre-processing because they originally contained only
URL links or mentions, or they were just empty text from the beginning.
We will proceed to simply delete these instances as we have no interest in keeping empty tweets.
They will just slow the learning.
69
6.2. Data visualisation and data Analysis Chapter 6. Implementation and Simulation Results
6.2 Data visualisation and data Analysis
To get an overall idea of what data are we treating and what are the most frequent words that
will be encountered, it is best to visualize them in a simple way.
We will start by the positive labels, then the negative ones. The library used is wordcloud. Its
documentation is given by [14].
Figure 6.1: Most frequent words(positive labels) using WordCloud
70
This figure 6.1 showed most of the frequent words. Wordcloud may not show the top frequently
used words, but it shows some of the words frequently encountered in the positive reviews.
To make sure we check the most frequently used words, we will place the words in a dictionary
as keys and their frequency as values, and then plot the number of occurrences of the top 20
words.
An example of the obtained results are shown in the figure 6.2.
Figure 6.2: Frequency of Most frequent words for positive labeles
We will proceed with the same steps for the negative-labeled tweets.
71
This is the figure 6.3 obtained by wordcloud for the negative-labeled tweets.
Figure 6.3: Most frequent words(negative labels) using WordCloud
These are the frequencies of the most frequent words in negative tweets in the figure 6.4 below.
Figure 6.4: Frequency of Most frequent words for negative labeles
72
It is shown that the most frequent words do not relate to whether the text is classified as positive
or negative. They are words used in every speech. They are called stop words.
We will deal with stop words later. As for now, we will treat the data as it is. For analysis
purposes, we will calculate the frequencies for each word in all the dataset and store all the
results in a .csv file. Then, we will check the first 10 rows of the dataset(the first 10 most
frequent ones) in the figure 6.5 below.
Figure 6.5: Frequency of Most frequent words
73
Then, we will merge the frequencies data for the 3 cases into one data frame and save it in our
word directory for further use.
74
Now, it would be interesting to check the overall curve of the most used words for a greater
number of words(400).
For that we will plot in the figure 6.6 below the first 500 ranks in the x-axis and the frequencies
of the corresponding words in the y-axis.
Figure 6.6: Most 500 frequent words
The curve seems a lot like a descending exponential function.

In the context of Natural Language Processing, the Sipf’s law is a discrete probability distribu-
tion that estimates the probability of encountering a word in a given set of words(corpus) [15].
The estimation is given by: f(r) = A × r1s , where r is the rank(in terms of frequency) of that
word in the given corpus.
A is just a proportional parameter(it should be the first-ranked word’s frequency [15].
s=1 in general or close to 1.
75
Let us plot the Sipf’s law distribution on the same graph 6.7 with the data distributions.
Figure 6.7: Verification of the Near-Zipf distribution theory
It is remarkable that most words don’t follow the distribution. The curve is a "near-Zipf" one.
Maybe the cause is the nature of the dataset(tweets) or maybe because of the stop words being
used very often.
But, what if the stop words are removed from the data? Will it give more significant and clearer
results?
For the following part, we will use CountVectorizer. Its CounVectoriezr’s documentation is
found in this reference [16].
It will help us to consider whether to count the stop words or not.
76
The set of words in the text(tweets) is noted V, and the number of types is the word token
vocabulary size |V|, referring to the number of unique words. Tokens are the total number N
of all the running words including repetitions [26].
countvectorizer.fit() takes all the words in all the sentences and builds a list containing all the
words, with respect to the countvectorizer parameters.
Since the maximum number of features is 10000 and the stop words are set to ’english’, it
will build a list of the m=10000 most frequent words, excluding the stop words. Then, with
countvectorizer.transform() we will get a matrix (n × m), where n=1596041 is the number of
tweets.
Each line represents a tweet and each column contains the occurrences of those words(types)
in the sentences.
So, the value obtained in the line i and column j is the occurrence of the word j (rank j in the
previous list) in the tweet i.
We can verify the dimension of the list and the obtained matrix.
We will now check the frequencies of the words in each set of examples (separated into a
negative-labeled set and positive-labeled one) after removing the stop words.
We will start by the negative words.
77
The same process is repeated for the negative tokens.
We can see when we piece together the negative, positive and total frequencies of the tokens in
one data frame, that the most frequent words are no longer the same because the stop words
have been eliminated by the tokenizer. In the generated table shown below, the words are
sorted by their total frequencies.
78
Let us now plot in the figure 6.8 and 6.9 respectively the first 50 negative-labeled words then
the 50 first positive-labeled words in terms of frequencies.
Figure 6.8: Frequency of the most 50 frequent words(positive labels) - Stop words removed
79
Figure 6.9: Frequency of the most 50 frequent words(negative labels) - Stop words removed
It is true that we got rid of a lot the stop words, but there are still "neutral" words such as day
or did that don’t really refer to a positive or negative sentiment. In fact, these words appear
in both classes with nearly equal frequencies. So, how can wen consider that word gives a
meaning to the sentiment classification?
We can filter our words (the 10 000 tokens) and get only those whose max(positive_ ftotal_
requency,negative_ f requency)
f requency
is superior to a certain threshold (certainly >0.5), but then we will lose a lot of tokens and it
will mess the training.
We will keep all the 10 000 tokens for the moment.
80
6.3. Train/Test process Chapter 6. Implementation and Simulation Results
6.3 Train/Test process
First, the dataset must be divided into train and test. However, training on the whole dataset
and then feeding the model to the test may not be very effective.
Comparing models performance may be a complicated task. Given a dataset, there is an ap-
proach called k-Cross validation. As we take our training dataset and we randomly divide it
into k folds(groups). For each iteration, we take one of the folds and use it as test, while the
rest of the data is used to train the model and fit the parameters to it. The process is iterated
while changing the test fold each time [32].
Then, the average performance for each model obtained is referred as the criteria to select the
most performant model [17].
The other approach of validation is Train-Validation-Test Split. The whole dataset is split into
3 parts. The first (the largest one) is used for training. The validation set is used for model
selection. The last set is used to test the performance of the selected model.
The different models can be the same from a general point of view ,but their hyperparameters
differ. So, they are technically not the same and behave differently[17].
In fact, a hyperparameter is different from a parameter (model parameter). The model param-
eters are interior to the model and are estimated from the data; therefore they are saved as a
part of the learned model. We often obtain the values of the parameters by optimization and
minimization problems [18].
However, hyperparameters are a configuration that is external to the data and do not result
from data estimation. Being manually specified by the practitioner, they are tuned for a given
predictive modeling problem. We are not capable to know their optimal value, but it is about
tuning the hyperparameters and trying several combinations of the model to discover the opti-
mal parameters of the model resulting in the best performance [18].
We will use the train/validation/set approach in our problem. Since our data contains more
than 1 500 000 samples, a 1% of the data for validation and another 1% for test would be
enough to give significant results.
After splitting our data, we will create a .csv data file for each split part.
81
The 3 sets are saved into 3 .csv files as shown in the code below in order to the save the
same split for the rest of the project to give sense to the comparison results. A change in
the split(running the split again) may change the results/performances of the model, but just
slightly as long as we keep the split random.
After obtaining the 3 split data frames, we will verify the percentage of positive/negative words
and check whether the close to 50% to make sure the 3 sets keep the properties from the
original data set.
The frequency for each set are close to 0.5. So, the split is fair enough to keep.
As we have explained before, text processing requires a numerical representation of the data
to feed to the models, because ML algorithms can not work with raw text data. This process is
called feature extraction. We will use the Bag-of-Words(BoW) model for this process.
We will first define 2 functions that will train different models and calculate their accuracy to
compare them. It would be possible using Pipelines.
Making a pipeline is to sequentially apply a list of transforms and a final estimator by merging
them into one and hence, automate the machine learning process. It is extremely useful by
grouping sequence of steps in processing the data.
Pipelining is applied transforming features by normalzing them, converting text into vectors,
filling up missing data, etc.
However, it is also meant to predict variables by fitting algorithms (estimators). So, in a
pipeline, the list of transformers is first sequentially applied, then comes the processing of
the final estimator [9].
A documentation [19] is provided for further explanation.
82
Given a pipeline, the function below calculates the model_accuracy f=given a pre-defined
pipeline, a train set and a validation set. The model is evaluated on the validation set.
This function below creates different pipelines by tuning the hyperparameters: maximum num-
ber of words, keeping stops words or not, the Machine Leaning algorithm to be applied to
the numerical data(Support Vector Machines or Logistic Regression) and the type of vectorizer
used: count Vectorizer or TFIDF vectorizer.
83
6.3.1 Count Vectorizer
Uni-grams
We will compare the performances of logistic regression when we apply a Count Vectorizer
while removing the stop words from the tokens, while we remove a number of most frequent
words and when we keep all the tokens; each with a number of maximum tokens from 1000
to 10000. The figure below shows the result of the beginning of the first model used with the
computation time. Each time n_feature_tuner runs a pipeline with the parameters explained in
the comments for 10 possibilities of maximum words range from 10000 to 100000 words.
Giving we have 5 pipelines, the training was executed 50 times each with an accuracy score.
We only showed the first 4 processing results to avoid overloading the screen.
84
We have tried different versions of logistic regression applied on uni-gram tokens. The differ-
ence consisted in the tokenizer parameters(number of tokens and the stop words used).
Let us plot the different accuracy curves and try to compare them. Each curve presents in the
figure 6.10 the 10 accuracy results obtained by n_feature_tuner.
Figure 6.10: Accuracy comparison for Logistic Regression models, uni-grams - Count
Vectorizer
It may not seem to make a lot of sense, but using stop words and increasing their number in the
word vectorizer has only lowered the accuracy of the model. Surprisingly, the best performance
was obtained by dropping none of the words and keeping the standard word tokenizer.
85
Now, we will consider training the data with a different algorithm: Support vector machines.
However, we will not consider stop_words for tokenizing since it has not proven its efficiency
in training.
We will try to tune the hyperparameters C which is the regularization parameter (the less the
value C is, the wider the margin is) and the penalty ’l1’ or ’l2’ which is a parameter to control
linear sum of slack variables in the optimization objective function or a square sum of slack
variables [38].
Then, we will compare the performances in term of accuracy by plotting the obtained accuracy
scores for the different classifiers. As we have chosen not to use stop words and keep all the
words, we decided to tune the hyperparameters of the classifier SVM instead of those of the
vectorizer.
The figure shows the different combinations(8) of parameters, each computed for 10 possibil-
ities of maximum number of features. Therefore, we have 80 processing results and accuracy
scores.
86
SVM models with "l2" penalty take much higher time in computation. So, we will consider
only those with "l1" and ignore the other half: SV5, SV6, SV7 and SV8. Now, we will plot the
different curves(40) in the figure 6.11 to compare the algorithms.
Figure 6.11: Accuracy comparison for SVM models, uni-grams - Count Vectorizer
87
We can say that the Support Vector Machines classifier with C=5 has the best performance.
But unlike logistic regression, the performance seems to decrease as we constantly reduce the
maximum number of features.
The optimal performance is obtained for about 2000-3000 tokens, which is probably too low
considering there will be much more words in new samples in the test or validation set that
have not been vectorized.
Below in the figure 6.12 is a comparison between the selected SVM and the selected logistic
regression model. As predicted, logistic regression performance is higher.
Figure 6.12: Accuracy comparison for the chosen SVM and Logistic Regression models,
uni-grams - Count vectorizer
88
Bi-grams and Tri-grams
Now, we will try Logistic regression on text pre-processed to bigrams, as previously by Count
Vectorizer. The classifier used is Logistic Regression(the one which gave us the best performance
for uni-grams).
89
Let us plot in the figure 6.13 the accuracy curves for Logistic Regression in both cases: Uni-
grams and Bi-grams.
Figure 6.13: Accuracy comparison (uni-grams and bi-grams) for Logistic Regression model -
Count Vectorizer
Now, the same process is repeated with tri-grams. A final comparison is illustrated for the 3
n-grams cases.
90
91
In the figure 6.14, we show the figure relative to uni-grams, bi-grams and tri-grams all with the
same classifier: Logistic Regression(the standard one).
Figure 6.14: Accuracy comparison (uni-grams, bi-grams and tri-grams) for Logistic Regression
model - Count Vectorizer
It is clear that Tri-gram bag of words has given the best performance in terms of accuracy.
We can check in the dataframes of the accuracy scores that the best scores are obtained for:
70000 tokens for uni-grams : 79.52%
60000 or 90000 tokens for bi-grams : 81.43%
70000 tokens for tri-grams : 81.84%
To avoid recalculations, we will define a function that takes as parameters the vectorizer along
with its parameters, the classifier and the training set. Then, it will return the model with the
tuned parameters to be applied to the text validation or test set. Then, we will save these 3
models for further comparison.
92
6.3.2 TFIDF Vectorizer
Count vectorizer may not always be the best method to vectorize our dataset. Another way to
do so is the TF-IDF vectorizer.
As CountVectorizer, there is module TfidfVectorizer with a similar syntax [20]. We will proceed
with a similar work to the word vectorizer, using Logistic regression. We will start by Unigrams
then proceed to bigrams and trigrams for different number of features.
We will then plot the curve of the evolution of the 3 models accuracy scores.
93
In the figure 6.15, we show the curve realtive to uni-grams, bi-grams and tri-grams in the TFIDF
vectorizer case showing the accuracy of Logistic Regression.
Figure 6.15: Accuracy comparison (uni-grams, bi-grams and tri-grams) for Logistic Regression
model - TFIDF Vectorizer
Similarly to the Count vectorizer case, we can see that bi-grams and tri-grams have relatively
much higher accuracy scores.
Tri-grams have a better score than bi-grams when the number of terms surpasses 4000.
The best accuracy scores are each case are obtained with a certain number of terms: 100000
tokens for unigrams : 79.77% 100000 tokens for bigrams : 82.07% 80000 terms for trigrams :
82.28% Then, we will save our 3 models generated by TFIDF vectorizer.
It would be interesting to compare the executed models with the same ones but excluding the
stop words from the vectorizer. The executed models without the stop words will give lower
accuracy results: they didn’t enhance the model. It is kind of explainable as the TFIDF vectorizer
gives lower score to the stop words because they are used in many more sentences, so their IDF
would probably be very low. Thus, probably a portion of them may not even be a part of the
terms list.
94
Now, let us compare these 3 models generated in the figure 6.16 by TFIDF vectorizer with those
generated by Count vectorizer:
Figure 6.16: Accuracy comparison (uni-grams, bi-grams and Trigrams) for Logistic Regression
model - Count Vectorizer and TFIDF Vectorizer
It is clear that TFIDF vectorizers have better performance than count vectorizer in term of
accuracy for logistic regression.
95
6.4. Models test Chapter 6. Implementation and Simulation Results
6.4 Models test
After having comparing our models through the validation dataset, we have been able to pick
one classifier along with a customized vectorizer for each n-gram. Their performance can now
be tested with a full report. This report will contain all the information regarding accuracy,
precision scores, recall,etc.
We will then define a function that takes the fitted model obtained by model_tuner and lunch
them on the test set. The function will output the predicted output, the accuracy score, the
confusion matrix and the performance report.
Then, the different models will be compared. We will start by defining a function that takes the
whole model(pipeline) along with the test set and returns the predicted output and the whole
performance report.
96
The function coded is executed for the 6 models. We will print the generated performance
report for the first classifier.
The following tables correspond to the confusion matrices of each model.
• Logistic Regression - Unigrams
Predicted Negatives Predicted Positives

True Negatives 6276 1723
True Positives 1471 6491
Table 6.1: Logistic Regression, Count Vectorizer, Uni-grams
97
• Logistic Regression - Bigrams

Table 6.2: Logistic Regression, Count Vectorizer, Bi-grams
• Logistic Regression - Trigrams

Table 6.3: Logistic Regression, Count Vectorizer, Tri-grams
• Logistic Regression TFIDF - Unigrams

Table 6.4: Logistic Regression, TFIDF, Uni-gram
• Logistic Regression TFIDF - Bigrams

Table 6.5: Logistic Regression, TFIDF, Bi-grams
98
• Logistic Regression TFIDF - Trigrams

Table 6.6: Logistic Regression, TFIDF, Tri-grams
We will summarize the performance scores in the table 6.7 below.

Uni-grams Bi-grams Tri-grams Uni-grams Bi-grams Tri-grams
Count Vectorizer Count Vectorizer Count Vectorizer TFIDF Vectorizer TFIDF Vectorizer TFIDF Vectorizer
Positive
79.024% 81.003% 80.709% 79.258% 81.797% 81.744%
Precision
Negative
81.012% 83.444% 83.203% 80.896% 83.206% 83.184%
Precision
Recall 81.525% 83.974% 83.760% 81.299% 83.471% 83.459%
Positive
80.255% 82.462% 82.206% 80.265% 82.626% 82.593%
F-Score
Negative
79.715% 81.892% 81.608% 79.846% 82.349% 82.307%
F-Score
Specificity 79.46% 80.398% 80.073% 78.822% 81.510% 81.448%
Accuracy 79.989% 82.182% 81.912% 80.058% 82.489% 82.451%
Table 6.7: Performance Analysis
Performance Analysis
We can notice that negative precision is better than positive precision with a slight difference
of 1-3%; which means that the predictions given by a model tend to be more correct in term of
percentage in the negative labels case: False Positives are more accurate than False Negatives. In
other words, the classifiers tend more to get their negative predictions right. That is illustrated
by the confusion matrices of the 6 models.
Maybe this difference is due to the kind of words that distinguish each class. Maybe words used
for the negative-labeled tweets are more distinguished, and can contribute more in terms of
effective tokenizing.
However, it is not the same for Recall. Positive recall is greater than specificity(negative recall)
by a margin of 1-3%. The percentage of positives being predicted right is greater than the
percentage of negatives predicted correctly.
As we said before, to get a suitable compromise and balance between precision and recall, we
consider the F1-score. The F-measure does not really have a significance or interpretation. It
is is basically just a combination between Recall and Precision to find a compromise between
them. It helps us more decide which model to choose.
99
We have noticed that tri-grams have performed slightly better than bi-grams on the validation
set. However, it is the opposite for the test set. This is not very surprising due to the small
difference and because the random generating/split of data plays a role in the the performance
results. A few changes in the test set may give opposite results.
But overall, bi-grams and tri-grams have proven to be more precise and accurate than uni-grams
and thus more performant.
Conclusion
When applying the learning procedure to dataset, we continuously tried changing the Hyperpa-
rameters of each implemented algorithm and we noticed that the performance and the accuracy
changes proportionally.
So, trying a maximum number of algorithms and different hyperparameters helps us get to the
optimal model.
In fact, NLP is not only about the simple count vectorizers and TFIDF vectorizers. There is a va-
riety of vectorizers that takes into consideration the lexical approach and the grammar; which
can certainly enhance the model and make more sense of the data.
100
Conclusion
This project has been a real challenge for us in our 2nd year since it was a new domain for us
and since Machine learning requires a deep understanding of very complex methods in applied
mathematics and statistics.
Nevertheless, we tried to be as precise as we could in order to deliver an explication to every
new concept we faced.
In this work, our interest was focused on sentiment analysis with machine learning techniques
and natural language processing. Therefore, a review of these concepts was first introduced.
Then, the retained machine learning techniques were highlighted before adapting them to our
problem. The considered dataset relies on opinions from social networks Twitter. A data pre-
processing was performed before a data visualization and analysis. Afterwards a training step
was carried out.
We wanted to try more methods in Deep Learning and ANNs which could have been more
precise. However, due to the time constraint and to the complexity of the method itself, this
part was not achieved. So this work may be the ground of further improvements. Finally, we
can conclude that this project was beneficial on several levels, especially the exploration of ma-
chine learning field and the implementation of theoretical knowledge that we acquired during
courses in our school ENIT .
It was a long ride and a lot of time investment, since we learned every thing from the scratch
and made sure to learn all the fundamentals before getting started into practice. But, it was
worth the time and it just made us more passionate about Machine Learning.
101
Bibliography
[1] Gentle Introduction to machine Learening https://towardsdatascience.com/a-gentle-intro-

duction-to-machine-learning-599210ec34ad, revised on 4th of june 2020.
[2] how to build your own neural network from scratch in python, https://towardsdatascienc
e.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6, re-
vised on 4th of june 2020.
[3] introduction to deep learning, https://hackernoon.com/introduction-to-deep-learning-
9064d6b87a51, revised on 4th of june 2020.
[4] machine deep learning, https://www.datacamp.com/community/tutorials/machine-deep-
learning, revised on 4th of june 2020.
[5] illustrated guide to recurrent neural networks, https://towardsdatascience.com/illustrated-
guide-to-recurrent-neural-networks-79e5eb8049c9, revised on 4th of june 2020.
[6] recurrent neural networks, https://towardsdatascience.com/recurrent-neural-networks-
d4642c9bc7ce, revised on 4th of june 2020.
[7] Gradient Descent, https://github.com/SoojungHong/MachineLearning/wiki/Gradient--
Descent, revised on 4th of june 2020.
[8] https://scikit-learn.org/stable/modules/classes.htmlmodule-sklearn.svm, revised on 4th
of june 2020.
[9] pipelining in python, https://medium.com/@shivangisareen/pipelining-in-python-7edd2
382f67d, revised on 4th of june 2020.
[10] https://pandas.pydata.org/docs/user_guide/index.html, revised on 4th of june 2020.
[11] https://pypi.org/project/beautifulsoup4/, revised on 4th of june 2020.
[12] https://www.w3schools.com/python/python_regex.asp, revised on 4th of june 2020.
[13] https://www.nltk.org/, revised on 4th of june 2020.
[14] https://pypi.or=g/project/wordcloud/, revised on 4th of june 2020.
[15] using zipfs law to improve neural language models, https://medium.com/@_init_/using-
zipfs-law-to-improve-neural-language-models-4c3d66e6d2f6, revised on 4th of june 2020.
[16] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Count
Vectorizer.html, revised on 4th of june 2020.
102
Bibliography Bibliography
[17] https://scikit-learn.org/stable/modules/cross_validation.html, revised on 4th of june -

2020.
[18] difference between a parameter and a hyperparameter, https://machinelearningmastery.
com/difference-between-a-parameter-and-a-hyperparameter, revised on 4th of june 2020.
[19] https://scikit-learn.org/stable/modules/classes.htmlmodule-sklearn.pipeline, revised on
4th of june 2020.
[20] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Tfidf
Vectorizer.html, revised on 4th of june 2020.
[21] Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019.
[22] Ayon Dey. “Machine Learning Algorithms: A Review”. In: International Journal of Com-
puter Science and Information Technologies Vol 7 (2016). Machine Learning Algorithms:
A Review.
[23] Aurélien Géron. Hands on Machine Learning with Scikit Learn and Tensorflow. O’Reilly,
2017.
[24] Hossin and Sulaiman. “A REVIEW ON EVALUATION METRICS FOR DATA CLASSIFICA-
TION EVALUATIONS”. In: International Journal of Data Mining Knowledge Management
Process (IJDKP) Vol.5 (2011).
[25] Cong Ma Jianqing Fan and Yiqiao Zhong. A Selective Overview of Deep Learning. 2019.
[26] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2019.
[27] Marwan Khan1 and Sanam Noor. “Performance Analysis of Regression-Machine Learning
Algorithms for Predication of Runoff Time”. In: Agrotechnology Vol.8 (2019).
[28] Cheng Soon Ong Marc Peter Deisenroth A. Aldo Faisal. Mathemathics for Machine Learn-
ing. Cambridge University Press, 2020.
[29] Andrew Ng. CS229 Lecture notes Stanford University. 2019.
[30] Maheshkumar H. Kolekar Pradeep Kumar Singh Arpan Kumar Kar Yashwant Singh and
Sudeep Tanwar. Proceedings of ICRIC 2019. Springer, 2019.
[31] Nitin Nandkumar Sakhare and S. Sagar Imambi. “Performance Analysis of Regression
Based Machine Learning Techniques for Prediction of Stock Market Movements”. In:
International Journal of Recent Technology and Engineering (IJRTE) Vol.7 (2019).
[32] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory
to Algorithms. Cambridge University Press, 2014.
[33] Han Zhu Shiliang Sun Zehui Cao. A Survey of Optimization Methods from a Machine
Learning Perspective. 2019.
[34] Avram Sidi. “Unified treatment of regula falsi, Newton-Raphson, secant, and Steffensen
methods for nonlinear equations”. In: Journal of Online Mathematics and Its Applications
(2016).
103
Bibliography Bibliography
[35] Alex Smola and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge Uni-
versity Press, 2008.
[36] Oliver Theobald. Machine Learning for Absolute Beginners. Jeremy Pederson, 2017.
[37] Jerome Friedman Trevor Hastie Robert Tibshirani. The Elements of Statistical Learning :
Data Mining, Inference, and Prediction. Springer, 2001.
[38] Shigeo Abe Yoshiaki Koshiba. “Comparison of L1 and L2 Support Vector Machines”. In:
(2011).
104
Annex
ML: Machine Learning

DL: Deep Learning
AI: Artificial intelligence
NLP: Natural Language Processing
SVM: Support Vector Machine
LR: Logistic Regression
105

Rapport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rapport

Uploaded by

Copyright:

Available Formats

U NIVERSITY OF T UNIS EL M ANAR

N ATIONAL E NGINEERING S CHOOL OF T UNIS

P ROJECT OF 2 ND Y EAR - M ODELING FOR I NDUSTRY AND S ERVICES

Machine Learning for Sentiment Analysis:

University year: 2019/2020

1 Introduction to Machine Learning 10

2 Machine learning techniques 23

3 About Deep Learning 33

3.3.1 Multi-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Sentiment Analysis by using Machine Learning Techniques 43

5 Natural Language Processing 58

6 Implementation and Simulation Results 63

6.3.2 TFIDF Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

1.1 Data Science vs Big Data vs Data Mining [36] . . . . . . . . . . . . . . . . . . . . 14

3.1 ANNs structure [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.8 Separate input networks [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Sigmoid function curve [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 Most frequent words(positive labels) using WordCloud . . . . . . . . . . . . . . . 70

2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.1 Logistic Regression, Count Vectorizer, Uni-grams . . . . . . . . . . . . . . . . . . . 97

Introduction to Machine Learning

1.1 Relationships with other disciplines

1.1.1 Artificial Intelligence

• Machine Learning and Deep Learning

• Natural Language processing

• Complex problem solving

1.1.2 Data Science

1.1.3 Data Mining

1.1.4 Big Data

1.2 Basic concepts

Figure 1.1: Data Science vs Big Data vs Data Mining [36]

As we have seen in previous sections, Machine learning is an interdisciplinary field sharing

1.2.2 Machine Learning procedure

The classical approach for solving problems is illustrated in figure 1.2:

Figure 1.2: Classical approach for problem solving [23]

Figure 1.3: Problem Solving with machine learning approach [23]

Figure 1.4: Data Update feature in Machine learning [23]

Figure 1.5: Machine Learning Procedure [30]

1.2.3 Need for Machine learning

1.2.4 Types of Machine learning

• Supervision learning: supervised, unsupervised, semi-supervised, reinforcement learn-

• Data income form: batch or online learning

• Generalization method: instance or model based learning

It indicates whether the learning is supervised, unsupervised, semi-supervised or reinforced.

• Supervised training: it refers to learning guided by human observation and feedback

Figure 1.6: Example of supervised training [23]

Figure 1.7: Mechanism of Unsupervised learning [23]

Figure 1.8: Mechanism of Semi supervised learning [23]

Figure 1.9: Mechanism of Reinforcement learning [23]

Data income form

The corresponding mechanism is shown in figure 1.10

Figure 1.10: Mechanism of Online learning [23]

Instance or model based:

This requires a measure of similarity between two e-mails [23].

Figure 1.11: Mechanism of Instance-based learning [23]

Figure 1.12: Mechanism of model-based learning [23]

Machine learning techniques

2.1 Supervised machine learning

From Multi-class to binary classification

Predicted Positives Predicted Negatives

Table 2.1: Confusion Matrix

Figure 2.2: housing price [36]

• Root mean-squared error (RMSE): Calculates the root squares of MSE.