Lad01305 Datamining Mlanddl U1

DATA MINING, MACHINE LEARNING AND DEEP LEARNING
SUPERVISED LEARNING (I)

Data Mining, Machine Learning and Deep Learning – Supervised Learning (I)
© Structuralia 2
INDEX
1. INTRODUCTION ....................................................................................................................................................... 4
1.1 What are Data Mining, Machine Learning and Deep Learning? ............................................................................ 4
1.2 Some Machine Learning applications .................................................................................................................. 14
1.3 Machine Learning Categories: Supervised Learning, Unsupervised Learning, Reinforcement Learning and
Semi-supervised Solutions ........................................................................................................................................ 21
1.4 Preparing the work environment .......................................................................................................................... 29
2. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (I) ......................................................................... 48
2.1 What is supervised learning?............................................................................................................................... 48

2.2 Regression vs. classification................................................................................................................................ 51
2.3 Simple, multiple and logistic Linear Regression: Initial Intuition .......................................................................... 54
2.4 Mathematical description ..................................................................................................................................... 54
2.5 Interpretation of coefficients................................................................................................................................. 60
2.6 Assumptions on linear regression models ........................................................................................................... 62
2.7 Additional considerations ..................................................................................................................................... 70
3. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (II) ........................................................................ 74
3.1 Logistic Regression ............................................................................................................................................. 74

3.2 Multinomial logistic regression ............................................................................................................................. 80
3.3 Nonlinear regression............................................................................................................................................ 81
3.4 Overfitting ............................................................................................................................................................ 84
3.5 Regularization in regression models .................................................................................................................... 86
3.6 Training data (train), validation (val) and test. ..................................................................................................... 93
3.7 Example 1............................................................................................................................................................ 94
3.8 Example 2.......................................................................................................................................................... 110
3.9 Example 3.......................................................................................................................................................... 116
4. SUPPORT VECTOR MACHINES (SVM) .............................................................................................................. 120
4.1 Description......................................................................................................................................................... 120

4.2 Example 1.......................................................................................................................................................... 142
4.3 Example 2.......................................................................................................................................................... 146
5. DECISION TREES................................................................................................................................................. 148
5.1 Initial intuition ..................................................................................................................................................... 148

5.2 Description......................................................................................................................................................... 152
5.3 Random forest ................................................................................................................................................... 162
5.4 Ensemble learning ............................................................................................................................................. 164
5.5 Example 1: Decision tree ................................................................................................................................... 165
5.6 Example 2: Random forest ................................................................................................................................ 170
3 © Structuralia
1. INTRODUCTION
The first chapter of the subject will establish a first definition of the key topics to be addressed in
the course. In turn, it will be explained that technological working environment will be used and
how to prepare it to be able to address the different tasks that will be carried out throughout the
course.
1.1 What are Data Mining, Machine Learning and Deep Learning?
The terms Data Mining (DM), Machine Learning (ML) and Deep Learning (DL) are some of the
most mentioned and consulted terms in the scientific literature at all levels, although it is true that
these terms are often used confusingly and inaccurately. This is why it is important, before
entering to approach the course, to succinctly define what these terms mean, what their scope is,
and how they relate to each other.
In order to properly understand them, it is necessary to mention that these fall within a broader
category of computer science known as Artificial Intelligence (AI).
AI has referred to the idea that computer systems could be developed that could simulate the
behavior of a human being, so that they could solve tasks that human beings would solve in a
similarly "intelligent" way. This term is not new, but its origins go back to a workshop held at
Dartmouth University in 1956 [1] where John McCarthy proposed this term to refer to "thinking
machines", initially seeking to be able to develop machines that perfectly simulate how humans
behave, how their mental processes work and how they interact with the environment, seeking to
be able to solve with these systems any kind of tasks that a person can solve, and even be able
to do better than it does. This is known as Artificial General Intelligence (AGI), which, although1
initially sounded like a truly promising idea, ended up not meeting the expectations that originated
at the time, leading to a great period of disillusionment with AI known as the "winter" of AI.
1Here we would really have to make a distinction between what is known as IAG (an intelligent system
capable of addressing any task or problem) and Strong Artificial Intelligence or Strong AI (where not only
does the above arise, but also seek to develop systems that for all purposes are like such human beings
because they can have self-awareness and mind).
© Structuralia 4
The technological paradigm of AI was changing over the years to, distancing from the idea of
focusing on developing a system that perfectly emulated a human being (something that is also
not easy to measure or verify, since classic solutions such as the Turing Test [2] have had
numerous criticisms about it in terms of being able to look like an absolute scale of rational
intelligence measurement [3]).
This new technological paradigm, known as Narrow AI, limited the scope and pretensions of this
technology, now proposing not to solve any kind of task with the same system, but to develop
systems very specialized in the resolution of certain specific tasks, where, again, they could solve
them more efficiently than a human being.
Unlike AGI, the specialized if what has come to demonstrate on numerous occasions its worth,
and is the technology that, to this day, is being more researched and that is being implemented
in more companies of all kinds of sectors. That is why this type of AI will be discussed specifically
in what follows this course.
AI can be subclassified both at the level of used technologies and at the application level for which
it can be used, and this is where we begin to link to key course concepts.
A classification of the different AI techniques is as follows:
▪ Machine Learning.
▪ Fuzzy Logic Systems.
▪ Expert Systems.
▪ Bayesian Networks.
▪ Evolutive Algorithms.
▪ Case-Based Systems.
For this reason, ML would be encompassed within one of the existing AI techniques, among many
others.
Thus, both ML and DL are part of it, but this is a broader concept, which would encompass other
technologies such as Expert Systems, Blurred Logic, Genetic Algorithms... that have nothing to
do with the aforementioned ML and DL techniques.
5 © Structuralia
Figure 1 Visual classification of the various techniques mentioned2
Thus, ML means a subset of AI that specifically seeks to be able to provide computer systems
with the ability to learn new knowledge for which machines had not been previously explicitly
programmed. Thus, for example, this would apply to develop a system that would be able to
deduce rules autonomously that would serve him so that he could differentiate when in an image
there was a tiger, a hen...
2Source:https://www.codesofinterest.com/2016/11/difference-artificial-intelligence-machine-learning-deep-
learning.html
© Structuralia 6
Figure 2 Example of a system that automatically infers patterns to classify different images into specific categories 3.
The learning and ability to generalize knowledge based on information to deduce rules is carried
out by different algorithms, that are only mathematical functions that will adjust specific
parameters, according to the information received, and with that, generalize the aforementioned
rules.
Therefore, Machine Learning will directly refer to a set of mathematical functions that will always
require f input data thanks to which they will be able to adjust certain parameters within the
mathematical functions in order to carry out a generalization that allows them to extract relevant
insights or conclusions from new future sets of information to which they have never been
exposed before.
3 Source: http://danielnouri.org/notes/category/bioacoustics/
7 © Structuralia
Figure 3 Comparative image of different approaches to extract information from the data. In the face of the classic
paradigm in which fixed rules are in place to deal with the different datasets, ML algorithms raise the possibility of
inferring new unprogrammed rules (gray color in the image) or even inferring that information from all the provided is
really relevant to infer those new rules (representational machine learning).4
There are numerous algorithms, many of which will be seen in detail throughout the course, but
there is a subset of them that are classified within what is known as Deep Learning or Deep
Learning.
4Source: https://www.slideshare.net/natalinobusa/deep-learning-overview-and-practical-use-in-marketing-
and-cybersecurity
© Structuralia 8
Figure 4 Historical evolution of technologies5.
DL takes the functioning of the human brain as inspiration and attempts to define different
architectures of neural connections that simulate in some way their functioning, thus resulting in
a type of algorithms that are bioinspired.
These DL techniques, as we will see in the image below, are not something really new; they have
existed since the middle of the 20th century.
5 Source: https://www.quora.com/What-are-other-fields-or-%E2%80%9Cbranches%E2%80%9D-of-AI-
outside-of-but-that-are-just-as-interesting-and-powerful-as-machine-learning
9 © Structuralia
Figure 5 Origins of different ML algorithms, specifically for many DL6.
However, these algorithms are computationally very complex and require a lot of information in
order to obtain a good generalization with their parameters. This is why they have gone unnoticed
for decades until there was a grow and improvement in two aspects: On the one hand at the level
of computing infrastructures with the use of GPUs and TPUs, and on the other at the data level
available thanks to the era of Big Data.
This has served to demonstrate the great potential of all these algorithms, especially for scenarios
where a lot of information is available.
6 Source: https://jisajournal.springeropen.com/articles/10.1186/s13174-018-0087-2
© Structuralia 10
Figure 6 Historical evolution of some ML applications and techniques along with the infrastructures used for them7.
Finally, it remains to define the meaning of Data Mining and how it relates to all of the above.
Just as ML and DL were techniques that fall within the AI technology stack, Data Mining is an
application of such techniques, along with other statistical techniques that do not fall within the
ML algorithm set, to extract relevant information from a particular dataset. This is another branch
within Computer Science.
7 Source: https://www.bdti.com/InsideDSP/2016/11/17/WaveComputing
11 © Structuralia
Figure 7 AI/ML applications including DM8.
DM branch traditionally not only includes data analysis, as mentioned above, but in also includes
knowledge and work on information systems that act as a repository for such data (database
systems), as well as data extraction and preparation techniques. However, for this course the
study will focus on ML algorithm applications for the analysis of information from different
datasets, thus studying the intersection that exists between all the disciplines mentioned. In turn,
the main techniques of processing and preparation of data for use with such algorithms, which
belong as such to the scope of the DM, will be mentioned.
8 Source: http://www.cellstrat.com/2017/10/27/applications-of-ai-ml/
© Structuralia 12
Figure 8 Intersections between the disciplines mentioned above9.
It should be noted that some of the algorithms to be analyzed will be inspired by other branches
of AI (specifically in Bayesian networks). These algorithms are covered within the course, as they
are now generally encompassed within the Machine Learning suite of techniques.
9 Source: https://www.adendavies.com/ai-wont-absolve-you/
13 © Structuralia
Figure 9 As we can see, data mining appears as an intersection of different disciplines.10
1.2 Some Machine Learning applications
Even if the human being is not aware, its daily life is imbued with different ML applications, many
of which have already been assumed as essential elements and without which it would be difficult
to perform some of the tasks that are carried out today.
The following, for illustrative purposes only, are some of these everyday applications in which the
central element that allows them to be carried out is precisely an ML algorithm:
Detecting people in photos
There are numerous social media platforms where it is already seen as normal that when an
image is uploaded to them the system identifies the faces of the people who are present and even
proposes to the user to label those faces with some of their friends.
10 Source: Data Mining Subject of the Master of Decision Systems Engineering of the URJC.
© Structuralia 14
Figure 9 Example of face tagging on Facebook.
Autocomplete texts
Virtually all smartphones used today offer the possibility of autocompleting of texts, proposing
how to finish phrases or words to users based on what they are writing.
Figure 10 Autocomplete image of texts based on the characters entered.
15 © Structuralia
Voice-to-text transformation
Again, many mobile devices allow we to automatically transform a user's voice into the
corresponding typed text, something that is also carried out with ML algorithms.
Figure 11 Keyboard icon to transform a voice-to-text message.
Content recommendations
Numerous industries (retail, e-commerce...) have multiple contents that are provided to their users
in a personalized way according to different criteria.
Figure 13 Example of audiovisual content recommendations through Netflix, where the various recommendations
offered are made thanks to ML algorithms.
© Structuralia 16
Robotics
Another of the most relevant ML applications is the field of robotics, where thanks to these
algorithms systems are able to interact with the environment, and moreover, they are able to learn
to interact with any type of new environments they face.
Figure 12 humanoid robot (Atlas) from Boston Dynamics where it is appreciated how the system has learned to
dodge obstacles thanks to different learning techniques.
Medical diagnosis
ML algorithms are also used within the healthcare field for numerous applications, including
medical diagnosis assistance and support. These systems are able to learn to recognize many
pathologies and diseases by analyzing medical images, and on numerous occasions they do so
more accurately than a doctor.
17 © Structuralia
Figure 13 Example of medical imaging.
Video games
The use of ML algorithms is also present in the video game industry, where they are used both
to provide intelligence to the different artificial agents present in that game, as well as for other
newer applications such as virtual reality or augmented reality.
Figure 14 ML used on different video game platforms to create intelligent virtual agents
© Structuralia 18
Virtual assistants
Today both on mobile devices and through physical elements (Google Home, Alexa...) there are
virtual assistants that allow users to interact directly with a large number of systems through a
single access point by dialogue with those assistants.
Figure 15 Example of dialogue with Siri
Information retrieval systems
Information retrieval systems, such as many search engines (Google, Bing...) allow users to find
the most relevant information from a whole giant compendium of content based on the queries
they make. Determining what content is relevant or not is done in part by these algorithms.
Figure 16 Google Search Engine
19 © Structuralia
Mail spam filtering
It is already common the use of email systems that automatically filter which emails are spam
and which are not to leave only the relevant emails in the input folder. Spam detection is often
done using ML algorithms and, specifically, one that will be addressed precisely within this
course.
All these systems are made possible by the ability of these algorithms to extract numerous
insights from a huge amount of available data. It is true that the extraction of knowledge based
on different sets of information can be done in many cases manually, but to this day there are
more and more amounts of data, which makes it essential that there are techniques that are able
to extract knowledge automatically and without the need of a human agent. With this, we can go
from analyzing and working with small datasets (as traditionally done) to being able to potentially
extract information from all the data that is constantly generated from different source sources.
© Structuralia 20
Figure 17 Evolution of data generated globally from 2010 to the 2020 forecast11
1.3 Machine Learning Categories: Supervised Learning, Unsupervised Learning,

Reinforcement Learning and Semi-supervised Solutions
ML is mainly classified into 3 clearly differentiated categories according to which the central part
of this course will be divided. These categories, Supervised Learning, Unsupervised Learning,
and Reinforcement Learning are detailed below at an intuitive level to serve as an initial guide
and understanding for well-placed when delve into the technical aspects in the following topics.
All of these algorithms share the one they are able to learn from a certain set of available data,
so they are subsequently able to generalize what they have learned to new data they have never
seen before. However, the way they learn is what distinguishes them and what gives rise to the
aforementioned categories detailed below.
11 Source: https://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf
21 © Structuralia
1.3.1. Supervised Learning
As already advanced, what distinguishes these sets of algorithms is how they will learn and
generalize about the data they receive.
The best way to understand all of the above is with a few simple examples. Consider that we have
a set of data in which we have different features that would characterize an animal (size, weight,
whether or not it has peak, color, whether or not they are vertebrates, whether or not they are
birds...). We might want to have an algorithm that could, based on the information provided, be
able to deduce what type of animal it was.
In the case of supervised learning, the starting point would be a few cases in which the category
to which they belong at the outset (for example because a specialist has previously identified
them) and based on the relationships between features and output (animal) the learning of the
system would be generalized in order to identify new animals from their features.
Figure 18 Simple example of classification of a supervised algorithm12
12 Source: https://blog.westerndigital.com/machine-learning-pipeline-object-storage/supervised-learning-
diagram/
© Structuralia 22
This might seem similar to a fixed rule system (If-Else) in which the animal's name was fixed to
features such as color or whether or not they have peaks. But what would happen when working
with numeric variables such as weight or size? Would we need to have a rule that considers all
possible numeric values? This would be unmanageable, and because of this these supervised
learning systems appear as useful that can make the adjustment parameters of the algorithms
adapt according to the data used for the training of the system (i.e. the data with which it has
information of features and animal) in order to be able to identify animals with values of those
different features.
Figure 19 Supervised learning functioning scheme13.
Again, intuitively, the system checks the training data and tries to predict the type of animal with
the available feature data. As the real category (animal in this case) is available for the training
data, the system will automatically compare its response with the real response and see if it has
succeeded or if it has not done so, and if it has not done so it will reset its parameters so that its
output tries to adapt more to the real output.
13 Source: https://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-
unsupervised-and-reinforcement-learning/
23 © Structuralia
Thus, in short, these systems could be seen as black boxes in which there is a mathematical
function (with certain parameters or coefficients that can be adjusted) that relates input
parameters (traits) to a specific output.
ÿ = 𝐹(𝑋)
We then have a F function that receives an array with the values of the input parameters
represented in that parameter X and in which each row of the array would correspond to the
available data of each of the animals and the columns would be the different features that are
being used for analysis. On the other hand, there is the vector-column that represents the different
types of animals (one per row) that the system thinks that correspond to the data that is in X for
specific values of the function parameters.
With that data predicted in the system, it could try to minimize the following:
min⁡(𝑦 − ÿ)
Where y would be the real data of animal names, so it will look for the best combination of
parameters of function F() such that that difference is as small as possible; that is, even if the
system is not going to guess right, it always tries to fail as little as possible.
Such algorithms are probably the most commonly used within the field of data science. It should
be said, especially in Big Data projects, that categories (the vector y) do not always have to be
generated by a user manually but may also have been generated automatically by a system (for
example, identifying whether a client is going to leave the company based on features such as
their age, their gender... the class to be identified, whether or not to leave the company, is inferred
from a set of past data known as historical, so that it should not be identified manually).
Within this set of algorithms one more subcategorization can be highlighted depending on the
output variable being predicted, known as classification or regression problems.
© Structuralia 24
Classification
The output variable could be a category, such as the previous case of animals, where we try to
predict whether this is a "duck", a "dog"... For cases where the variable is a category, a particular
set of algorithms known as classification algorithms is used.
Regression
However, it could happen that the variable to be predicted was a real number, such as wanting
to predict the price of a house based on categories such as size, neighborhood, and... In this case
we would be facing a regression problem, and for these problems, as we will see in the
corresponding topic, specific algorithms will be used.
1.3.2. Unsupervised Learning
The situation described above is not the situation that will always occur with all the datasets that
we can work with. Going back to the case of animals, it might be quite likely that the category
associated with the available features would not be available since for example they could not be
manually identified for lack of time, resources... for example, we can imagine that sensors,
cameras, scales... a set of devices that give all the feature data described above for any animal
that passed through the system, but to have the classes it would be necessary for someone to
directly identify the animals in order to train the algorithms mentioned above.
For all cases where classes associated with input parameters are not available, an additional set
of algorithms known as unsupervised appears that are characterized by being able to learn and
generalize knowledge based on datasets without categories or without labels.
These systems look for relationships between the data values in each of the records (in the
example above, animals) to try to infer relationships between them. For example, a system might
attempt to group records that it considered similar within groups (or clusters). This is what a certain
set of algorithms known as segmentation or clustering would do, and they would do so
automatically without the need for that additional supervised knowledge that the other algorithms
mentioned do need.
25 © Structuralia
Figure 20 Simple example of an unsupervised algorithm in this clustering case where groups of similar animals are
inferred based on information or patterns that detect algorithms about the available dataset. Categories would not be
named directly with these algorithms as this would require a supervised element for that. However, the groups will be
provided automatically.
Figure 21 Unsupervised learning functioning scheme14.
© Structuralia 26
1.3.3. Reinforcement Learning
Reinforcement Learning (RL) is one of the newest and most recent paradigms in recent years
thanks to the advancement of various academic research and the success it has had in its
application to complex problems such as Google's AlphaGo system and its victory in a GO game
against the celebrated champion Tang Weixing in 2016 [4].
In RL, the distinctive element to consider is the aspect of the reward; in it, there will be an agent
and it will have at its disposal different actions to be performed and by trial-error and award-
reward, the result of testing how its actions interact with the environment, will end up deducing
what is the optimal combination for specific external conditions so that the reward function can be
optimized.
Returning to the Go example, the agent will have to learn after playing several consecutive
games, which concrete actions based on a specific state of the board will lead him to maximize
his objective function (which is to win, and winning in as few moves as possible). All this, and this
is one of the important features of this type of algorithms, is discovered autonomously by the
system without the need for human supervision. This is why there are advocates that this
paradigm (of great psychological inspiration in currents such as Pavlov, Skinner or Thorndike with
the theory of behaviorism) will be the one that truly faces what is an AI system and that will open
the way to AGI.
27 © Structuralia
Figure 22 Reinforcement Learning Operation Scheme15.
Figure 23 Example of an RL application. The system uses a reward system with a score system so that it learns the
best combination of its possible moves (jumping or crouching) in order to maximize it; the penalty (receiving less
reward at the end of the game) appears when we guess right one of the obstacles.
© Structuralia 28
1.4 Preparing the work environment
Installing Python and Anaconda
The environment to be used during the course for the various exercises will be the Python
distribution known as Anaconda (free distribution and open source). The main reason is
that this distribution already includes installed most of the packages that are used for ML,
as well as facilitates the installation of new packages if necessary and simplifies the
management of virtual environments (see below).
Below are the steps to install it from scratch on both Windows and Linux/Mac.
There are currently two major versions of Python, Python 2.7 and Python 3.7, that have coexisted
for many years. However, since it has been announced that version 2.7 will no later than have
support in approximately 116year, version 3.7 will be used in this course,17so it is recommended
to download that one.
Installation on Windows18
The steps to install Anaconda on Windows are as follows:
1. Download Anaconda from the website. As we choose Python version 3.7, Anaconda3
will be automatically downloaded. https://www.anaconda.com/download/#windows
16 Source: https://pythonclock.org/
17 The Python version 3.6.3 is also valid, even though it is a little older. During the course we may see in
the examples some program outputs in that version; Everything that will be raised is analogous in both.
18 Source: https://conda.io/docs/user-guide/install/windows.html
29 © Structuralia
Figure 26
2. Locate and double-click on the executable and follow the steps.
Figure 27
Figure 28
© Structuralia 30
3. During installation, choose the directory where the Anaconda files will be.
Figure 29
4. An important step is to decide whether to automatically associate Anaconda with

Windows environment variables. If we wanted to be able to use Anaconda directly from
the Windows terminal, we would have to select the "alternative" that appears in the image.
31 © Structuralia
Figure 24 STANDARD Option
Figure 25 ALTERNATIVE Option
© Structuralia 32
5. At the end of the installation it is proposed to install Microsoft VSCode, but this is
optional. It won't be necessary for the course.
Figure 32
6. Finish the installation.
Figure 33
33 © Structuralia
Optional: Add the Anaconda path to Windows later.
If we have not followed the alternate point in "4", we can add Anaconda to the environment
variables at any time to be able to run it from the Windows terminal, use Git Bash...
1. Check if it is already added. This opens the Windows terminal and executes the following
commands.
> python --version
> anaconda --version
Figure 26 Windows Terminal Open.
2. Depending on the result obtained se we can check whether Anaconda is added or not.
Figure 27 WIndows output if we are not on the path.
© Structuralia 34
Figure 28 Output when the path exists.
3. If it is not, the location of the Anaconda conda executable and Python executable is
located first.
> where python
> where conda
Figure 29 Location of the two executables.
4. And the above paths are added to the path of the environment variables (can be located
by typing "environment variables" in the Windows browser).
Figure 38
35 © Structuralia
If everything has been done correctly, point 1 could be repeated and the proper output of
2 could be obtained).
Instalación en Ubuntu19 / Mac OS20
Similarly, the installation in this case would be:
1) Download the Anaconda3 version for Python 3.7 from the website
https://www.anaconda.com/download/#linux
https://www.anaconda.com/download/#macos
2) From the downloads folder, run the file .sh
bash Anaconda3-5.2.0-Linux-x86_64.sh
This results in the following output:
Figure 3821
19 Source: https://conda.io/docs/user-guide/install/linux.html
21 Source: https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-
quickstart
© Structuralia 36
3) After completing the installation steps, the option to add Anaconda3 to the system PATH
will appear. It is recommended to say yes in order to use the Anaconda conda command
from the system terminal.
Figure 3922
At the end of the installation we will intend to install Visual Studio Code, although this is completely
optional and will not be required for this course.
4) To verify that everything is correct from the system terminal, the following commands
are executed:
$ python --version
$ anaconda --version
Which should give an output similar to the following:
Figure 4023
quickstart
quickstart
37 © Structuralia
Optional: Creating Virtual Environments
Python and Anaconda offer the option to work with what is known as virtual environments.
These environments are isolated instances where the packages that are installed remain
outside the other environments or the main Python distribution, making it easier to control
the libraries used and simplifying the environment reproducibility for running the
developed code.
Although it is not required to complete this course, here are the steps to do so.
1) Creating the environment.
$ conda create –-name my_env python=3
Where my_env is an arbitrary name to refer to the environment. This process will take
some time as the libraries that Anaconda brings by default are installed within that
environment.
2) Activating the environment.
The environment is activated in Windows as follows.
$ activate my_env
And in Linux/Mac
$ source activate my_env
Once activated, the following output will be seen:
And, from within the environment, we can run Python.
© Structuralia 38
In Windows:
(my_env) > python
On Ubuntu24:
(my_env) $ python3
This technique allows to create other new environments that have the minimum number of
packages required, compared to the main Distribution of Anaconda that has many packages
that may not be required for a particular application or code.
Proposed Integrated Development Environments (IDEs)
Spyder25
Spyder is one of the most popular IDEs for data science as it offers a friendly interface
very similar to those offered by other languages used for this purpose such as RStudio or
Matlab, and that offers a variable explorer to know the value of the parameters during
partial execution of the code, which serves as a great help for understanding the execution
flow and troubleshooting errors.
24 On Linux/Ubuntu in general it is best to specify the Python version, unless it has been modified manually,
as it may by default include a previous Python 2.7 distribution associated with the Python command.
25 https://www.spyder-ide.org/
39 © Structuralia
An additional advantage of this IDE is that it is purely developed in Python and is relatively
lightweight with what can be installed within virtual environments. This IDE also comes
bundled with the default installation of Anaconda.
To open Spyder we can do it from the search bar (Windows) or from the command terminal
(Windows/Linux/Mac) with the command:
$ spyder
The IDE looks like this26:
Figure 41
26 Image using the Dark Theme option from the editor. By default, should be white
© Structuralia 40
Jupyter Notebook27
Jupyter Notebook is an open-source web application that allows us to develop and execute
code in different programming languages (including Python) where we can combine cells
in which we run code, text, visualizations... so that interactive notebooks are created to
help present the development of a program.
The default installation of Anaconda includes Jupyter Notebook, which can be run with the
following command: <<
$ jupyter notebook
Figure 42
Once executed we can access it from any browser (Chrome, Firefox, Edge...) via the url:
http://localhost:8888/tree
27 http://jupyter.org/
41 © Structuralia
Figure 43
From there we can create new notebooks (files that will have the extension.ipynb) that look like
this:
Figure 44
Principal libraries to be used in the course:
NumPy28
NumPy is a library that facilitates the work with matrices and vectors and provides numerous
mathematical functions and algorithms to be able to operate on the data very efficiently.
Pandas29
The Pandas library makes it much easier to manipulate and display data allowing us to define
and operate on tables (dataframes) with a syntax very similar to SQL. It incorporates numerous
functions to be able to represent or transform datasets quickly and easily. Its implementation is
extremely optimal by including C-generated code in its development.
Scikit-Learn30
28 http://www.numpy.org/
29 https://pandas.pydata.org/
30 http://scikit-learn.org/stable/
© Structuralia 42
Scikit-Learn is one of the most important libraries for this course as it includes implementations
of a wide variety of ML algorithms for regression, classification, unsupervised analysis... so it will
be the ML reference library for this course. The models offered here are easily integrated with the
format of data obtained from NumPy or Pandas.
Matplotlib31
Matplotlib is a library for viewing 2D data in Python, looking to facilitate graphical representation.
There are other popular libraries to perform these tasks, although the course will mainly use this
one to offer a combination of efficiency and good graphic style along with ease of use.
Hard32
Keras is a library designed to develop neural networks in Python and will be the library to be used
for Deep Learning themes. The library seeks to speed up the development of these architectures
by operating on a backend of other more complex DL libraries such as Theano, TensorFlow or
CNTK, so that it is thus a wrapper on those libraries trying to homogenize and simplify the use of
them.
Statsmodels33
Statsmodels is a Python statistical package that complements everything offered from Scikit-
Learn at the ML level by offering tools more local to classical statistics but equally necessary for
many problems, such as hypothesis contrasts, time series or descriptive study of data. Thanks to
Statsmodels there are numerous tools in Python that have traditionally been reserved for
programming languages for more purely statistical use (such as R).
SciPy34
Another very useful library to know is SciPy. It offers the necessary tools to be able to carry out
mathematical computing problems such as integrations, linear or nonlinear optimization problems
31 https://matplotlib.org/
32 https://keras.io/
33 https://www.statsmodels.org/stable/index.html
34 https://www.scipy.org/
43 © Structuralia
(including metaheuristic techniques) or interpolation. To do this, it uses the mathematical objects

provided by NumPy and on them operates with its different functions.
Jupyter Lab35
As an alternative to Jupyter Notebooks, we can use Jupyter Lab, which offers a more complete
interface with more functionalities.
Google Colab36
Google offers the ability to work with Jupyter notebooks in the cloud, so we don't have to install
or prepare any on-premises environments, but everything runs on your servers.
Why Python?
The debate about what is the best programming language for working with ML is an open debate
with many advocates by all languages. The truth is that there is no one-size-all answer, and each
language has remarkable advantages that make it suitable for some situations more than for
others.
However, among all those advantages Python stands out for its versatility, wide ecosystem and
a large community behind it. Many of the most active developer and programmer communities,
such as Stack Overflow, contain numerous Q&As resolved from many doubts that may arise with
the use of this language.37
As seen with the library preview, Python includes the packages needed to work with ML at both
the research and prototyping level and at the production development level (even more so when
it can be easily integrated with Big Data processing tools such as Spark using APIs). In fact,
Python is a very mature language that is not only intended to work within this ML field but is also
used as a general-purpose language, which facilitates the development of end-to-end projects
using a single language for all development including the analytical process.
35 https://jupyterlab.readthedocs.io/en/stable/
36 https://colab.research.google.com/notebooks/intro.ipynb#recent=true
37 https://stackoverflow.com/
© Structuralia 44
Python also offers ease of use such as exploring variables with IDEs like Spyder and have a fairly
friendly and relatively similar syntax to other ML languages and analytics such as R or Matlab.
All of this has been causing Python to grow in popularity both within the community of data
scientists and profiles working daily with ML algorithms, as well as within more general-purpose
developer communities. In fact, the graphs below show how Python is the most popular
programming language used today.
Figure 30 Using Programming Languages by the Data Scientists Community38
38 Source: http://blog.revolutionanalytics.com/2018/06/pypl-programming-language-trends.html
45 © Structuralia
Figure 31 Python popularity increase in 201839
Figure 32 Comparison of Python evolution with other similar languages (R and Matlab)40
All of this causes most companies looking for candidates to work with Big Data and ML to ask as
a key skill for knowledge and mastery of this language. For this reason, it is considered that
© Structuralia 46
working with him during the course will offer to acquire a skill that will almost certainly be a good
investment for the future.
Some existing Open-Source libraries to work with ML in other languages.
Although this course will be developed entirely in Python, there are other languages and platforms
that allow us to work with ML algorithm implementations. Some of them are:
▪ A: R is a statistically oriented language that has numerous ML libraries. This is widely

used for all these applications both in the industry and in academia. The programming
language is available in https://www.r-project.org/ and the most popular IDE, RStudio,
can be found here https://www.rstudio.com/
▪ Spark MLib: Spark MLib is the library of the Spark distributed computing platform that
allows us to use the different ML algorithms within it. The documentation of the same can
be found in https://spark.apache.org/mllib/
▪ Apache Weka: There are libraries to work in Java with many ML algorithms, as is the
case with Apache Weka. https://www.cs.waikato.ac.nz/ml/weka/
47 © Structuralia
2. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (I)
2.1 What is supervised learning?
As already advanced in the Introduction, supervised learning refers to a set of ML algorithms that
seek to generalize an equation that relates a series of input parameters to associated values of
one or more output parameters, so that when new input datasets appear, the system can
associate a possible value of that output variable with them.
These input parameters of equations are generally referred to as features or predictors

(features). The following example shows dummy data from different companies and the benefit
associated with them. An ML model could help a company predict its profits based on the
distribution it has between expenses and the city in which it is headquartered.
Figure 48
Types of tasks
Thus, supervised learning models will be used primarily for predictive tasks that aim to estimate
the value of a variable using a number of available features.
© Structuralia 48
In addition to predictive tasks, other types of tasks that can be performed with data are
descriptive tasks that seek to identify patterns or relationships between a particular set of
features or variables. This is mainly done with unsupervised models although, as will be seen
below, it can also be applied at a certain level within the supervised models since many of them
allow to analyze how much the different variables contribute to predict the output variable (for
example, in the case described above it may have a big influence for the cost the city of the
headquarters of the company), so that a task that is also descriptive would be carried out. That's
why, within the descriptive tasks that exist, supervised models are applied exploratory analysis
tasks that consist of seeing correlations between variables, associations, or dependencies.
Data types
In addition to the above, it should be considered the type of data. In general, depending on the
data types, they can be classified into numeric data, binary data, or categorical data:
▪ Numeric: These are both integer and real numeric variables. For example, a person's
salary, age, weight kg...
▪ Binary: Binary data can acquire two types of values; 1/0, True/False...
▪ Categorical: Categorical data is a generalization of the previous case where the possible
values are no longer just two categories but can be three or more. For example, a person's
name, the type of line contracted...
Figure 1 This example shows the types of data mentioned applied to an example of a telco data table. In the binary
data column, we have two possible types of values depending on whether or not a customer has canceled the
contract with that company (0-False, we have not canceled; 1-True, yes, we have canceled).
49 © Structuralia
An additional element to consider in reference to data types is that all ML algorithms work with
data in numeric format. So, what happens to the categorical data mentioned above when they
are expressing a text value? In these cases, it will be necessary to perform a conversion of the
data to numeric format using a process known as category encoding (or label encoding). This
process transforms the different existing categories into integer numeric values in which each of
them is associated with one of the different categories. This would provide a vector of numeric
values that could already be used within all ML algorithms.
Figure 2 Example of previous data where each category is associated with a numeric value. For example, "Madrid"
becomes represented by the numeric value '2'.
© Structuralia 50
Again, for all these algorithms we will have to use numeric arrays of input-output data. Thus, they
can work with an arbitrary input data source as long as it can be transformed in some way to a
numeric matrix expression. There are numerous branches within AI that work on researching how
to transform data sources that could be videos, photos, texts... efficient numerical expressions
that allow the use of these algorithms on the extracted data. In this way, for example, we can
build applications such as face detection on Facebook mentioned above; using images as input,
they have been able to express themselves numerically so that thanks to the use of algorithms
similar to these it has been possible to detect when in an image there is a face (and where it is)
and when it is not.
2.2 Regression vs. classification
As mentioned in the introduction, the difference between one task type and another resides in the
output variable type. A regression problem works with numeric output variables while a
classification variable works with binary or categorical output variables.
A regression problem seeks to use a function (which can be linear or nonlinear) to relate the input
features to an output numeric value (which is not as such dimensioned; it can acquire values that
do not appear in the data with which the system is trained). The purpose of the model with the
training data will be to adjust the input-output function so that the error between the predicted
values and the real values is minimized (different evaluation metrics can be used for this purpose,
as will be seen below).
Figure 51 Regression problem. Using the input features (such as data about a home) a function is used to predict a
numerical value (such as price)
51 © Structuralia
Such input-output relationship functions can be displayed as equations within a coordinate system
such that they relate the value of certain coordinates (traits) to, using the function curve, see what
value would correspond to another coordinate (predicted variable). Evaluating the function for the
predictor coordinates would get the value of the prediction coordinate.
Figure 3 simple example of two variables would predict with the predictor x the value of the predicted variable y. ML
algorithms would give, based on the available training data, a certain function and with it for any new value of x would
have the corresponding value of y. Depending on the type of ML algorithms, linear relationship functions or nonlinear
relationship functions will be performed. In blue we would have the points used for training and that the function
wants to try to collect in the best way.41
In the case of the classification problem, an attempt will be made to predict the category (from
those defined a priori, which has to be categories used for system training) to which a specific set
of features belongs. This system will work with data associated with the categories and attempt
to adapt the model function so that the number of classes that are mislabeled is minimized.
41 Source: https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php
© Structuralia 52
Figure 4 Example of a classification problem. There are many medically relevant features that would characterize a
patient and we want to predict with that information whether or not we will have diabetes.
For classification models, from a graphical representation point of view, they would correspond
to equations that relate the values of coordinates (predictors) to the coordinate that is equivalent
to the predicted variable. In this case the output variable would not correspond to the value
evaluated in the function, but rather the function would separate the space from the different
values so that depending on the combination of values of the predictors one category or another
would be obtained as shown in the following image.
Figure 5 This example shows 2 predictors (x and y) and the ML model, depending on whether it is linear or nonlinear,
would give one of the red lines that appear. Depending on the combination of x and y predictor values of any new
value observed, it would be classified into one of the two categories that appear (blue or gray). 42
42 Source: https://technologyadvice.com/blog/information-technology/data-mining-techniques-non-
analysts/
53 © Structuralia
2.3 Simple, multiple and logistic Linear Regression: Initial Intuition
Regression models represent such input-to-output relationship functions as simply as possible.

This form of simplified relationship assumes that there is a linear relationship between the input
and output of data and therefore corresponds to a line within Cartesian space that could be
adjusted simply by appropriately assigning certain values to the coefficients of that line.
This would have an equation that when evaluated to new values of x would result in the value of
y being predicted.
2.4 Mathematical description
Simple linear regression
Simple linear regression is arguably the simplest case of ML model. It consists in predicting an
output variable by using a single predictor/regressor as the input variable, so that we would have
the following equation:
𝑦̃𝑖 = 𝛽0 + 𝛽1 × 𝑥𝑖
Where 𝑦̃𝑖 corresponds to a specific value of the output variable (𝑦) to be predicted, 𝑥𝑖 would
correspond to the predictor to be used to predict the output, 𝛽0 and 𝛽1 ⁡correspond to constants
that are precisely the ones that would have to be adjusted in the model, based on the training
data available, so that the output could be predicted in the best possible way.
Thus, this linear equation can be represented graphically as follows:
© Structuralia 54
Figure 6 For different blue training points the parameters of the linear equation would be adjusted 𝛽0 and so that the
line fits them as best 𝛽1 as possible.
The question that would arise at this point is how to adjust the parameters of the line equation,
and to do this, as already advanced, we will use an error metric that we want to minimize.
The function error will be how much the red line is deviated from the blue dots, as indicated in
the following image:
Figure 7can see that the error will be obtained based on the differences between the y-values in the real points
available in the training data (blue) and the points that the model line will give as predictions of y (red) for the same x
values (since they are in the same vertical)43
43 Source: https://towardsdatascience.com/simple-linear-regression-2421076a5892
55 © Structuralia
In this way, one way to quantify the error would be to see the difference between the vector of
real values of and against the vector of predicted values of and as shown in the following equation:
𝑒 = ⁡ 𝑦̃ − 𝑦
This error could be expressed as the minimum square error (RSS) as follows:
𝑛 𝑛
𝑅𝑆𝑆 = ⁡ ∑(𝑦̃𝑖 − ⁡ 𝑦𝑖 )2 = ⁡ ∑(𝛽0 + ⁡𝛽1 × 𝑥𝑖 − ⁡ 𝑦𝑖 ⁡)2

𝑖 𝑖
𝑏 ∗ = 𝑚𝑖𝑛(𝑅𝑆𝑆)
So, we would have to have a linear function that we would want to minimize. This could be done
simply by calculating partial derivatives for each of the parameters and matching the equations
obtained to 0, so that a system of equations of as many equations as partial derivatives (two in
this case) would be achieved which, when solved, would give the values of the coefficients that
minimize the error function. This is what is known as point estimation of coefficients.
𝜕⁡𝑅𝑆𝑆
=0
𝜕𝛽0
𝜕⁡𝑅𝑆𝑆
=0
𝜕𝛽1
Solving this system of equations would obtain the value of each of the two parameters 𝛽0 and with
𝛽1 the following results:
© Structuralia 56
Figure 8 are infinite lines that could potentially relate the input variable to the output variable, but we want to know the
one that minimizes the previous error from the available data.
This way of adjusting parameters is an analytical solution with which we part of those equations
to get the exact value of the parameters that gives the minimum value. This is not the only way
to calculate it; for more complex problems, where for example the number of parameters is very
high and therefore the system of equations is very extensive, it is more efficient to find
approximate solutions using numerical methods or simulation methods, some of which will be
mentioned later in the course.
Multiple linear regression
Multiple linear regression is a generalization of the previous case in which any number of X input
variables can be used to get a prediction of the y-value.
𝑦̃𝑖 = 𝛽0 + 𝛽1 × 𝑥1𝑖 ⁡ + 𝛽2 × 𝑥2𝑖 + 𝛽3 × 𝑥3𝑖 + ⋯ + 𝛽𝑛 × 𝑥𝑛𝑖
In case of multiple linear regression, the obtained equation would correspond to a hyperplane,
which could be observed graphically up to the case of 2 predictors as shown below.
57 © Structuralia
Figure 9 Example of a regression hyperplane for 2 predictors and an output variable 44
Moreover, mathematical development and obtaining them is a direct generalization of the

previous case and the steps that would have to be taken to obtain them are totally analogous
(except for the fact that we will have a larger system of equations to solve).
In the case of, for example, the equation was:
𝑦̃𝑖 = 𝛽0 + 𝛽1 × 𝑥1𝑖⁡ + ⁡ 𝛽2 ⁡ × 𝑥2𝑖
The problem to be minimized and its solution would be:
44 Source: https://www.mathworks.com/help/stats/regress.html
© Structuralia 58
Having a system of 3 equations with 3 unknowns that would be solved by evaluating it on all the
points of the available input data.
Generically, obtaining the optimal coefficients for the multiple linear regression problem using the
analytical solution would be:
𝑦̃ = 𝛽𝑋
𝛽0 𝑥01 = 1 ⋯ 𝑥0𝑛 = 1
With 𝛽 = ⁡ ( 𝛽…1 )and 𝑋 = ⁡ ( ⋮ ⋱ ⋮ )with each row of X being a row of existing data
𝛽𝑛 𝑥𝑛0 ⋯ 𝑥𝑛𝑛
in the training data.
With this, the error would be45:
𝑛
1 1
𝐽(𝛽) = ⁡𝑅𝑆𝑆 = ∑(𝑦̃𝑖 − ⁡ 𝑦𝑖 )2 = (𝑋𝛽 − 𝑦)𝑇 (𝑋𝛽 − 𝑦)⁡
2𝑛 2𝑛
𝑖
By operating the equation, we get to46:
45 It is divided by 1/2n so that the two expressions are equivalent; that term itself, being a constant, does
not affect the optimization process as such.
46 1/2n is ignored at this point as it does not bring anything to the optimization process
59 © Structuralia
𝐽(𝛽) = ((𝑋𝛽)𝑇 − 𝑦 𝑇 )(𝑋𝛽 − 𝑦) = (𝑋𝛽)𝑇 𝑋𝛽 − (𝑋𝛽)𝑇 𝑦 − 𝑦 𝑇 (𝑋𝛽) + 𝑦 𝑇 𝑦

= ⁡ 𝛽 𝑇 𝑋 𝑇 𝑋𝛽 − 2(𝑋𝛽)𝑇 𝑦 + 𝑦 𝑇 𝑦
By deriving this equation from 𝛽 we get:
𝛿𝐽(𝛽)
= 2𝑋 𝑇 𝑋𝛽 − 2𝑋 𝑇 𝑦 = 0 → 𝑋 𝑇 𝑋𝛽 = 𝑋 𝑇 𝑦⁡
𝛿𝛽
If the X matrix is invertible:
𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
Thus, the expression is closed for the coefficients of the linear regression model obtained with
the analytical calculation.
2.5 Interpretation of coefficients
In the equations of the previous lines or planes appear, on the one hand, the coefficients that
multiply the predictors and represent how much the predicted variable will vary for each unit
change in that predictor. For example, if in the simple regression equation x represents the size
in m2 of homes e and the price of them in euros the coefficient b1 would quantify how the
increase/decrease of each m2 affects the price of the house. For example, if the equation were:
𝑦̃𝑖 = 2000 × 𝑥𝑖
It would have to be increased/decreased for every square meter the price would
increase/decrease by 2000 euros.
The interpretation for the multivariate case is slightly more complex but remains simple; assuming
that all variables except xin remain constant, bn represents how much the output will vary for each
unit that xin increases/decreases (it is important to note that to say this, indeed, the other variables
© Structuralia 60
must remain unchanged in order to make this reflection of isolated influences of a single of the
variables). It is in these cases that the coefficients can be seen as slopes of the line that relates
the output to one of the variables.
The independent term b0, on the other hand, has more difficult interpretation, as it basically
indicates an offset when it has a value other than 0. In such cases it is called intercept and serves,
mainly, to better adjust a model and collect biases or influences not considered with the
parameters introduced in the model but that could be having relevance in it.
The next question that would arise would be how to know which coefficients really influence the
prediction of the output, which may not be as relevant, which influence the most... This can be
checked with a parameter called p-value that represents the result of a statistical study on the
data where we want to answer the question of whether or not a given coefficient is influential in
the value of the output (this is done with a technique called hypothesis contrast).47
This p-value, which can be obtained automatically with different software tools, appears with a
specific value for each coefficient and represents how statistically significant each of these
coefficients are. It is typically defined as a comparison value α a value of 0.05. In this way, when
a p-value is greater than the defined value of α that parameter will be considered not significant,
whereas if it is lower, that parameter cannot be discarded from the regression function as it is
influential.
Figure 10 example shows how the variable 'East' has a p-value > 0.05, so it should be discarded from the regression
model as it is not an influential variable for the output value.48
47 For more information on hypothesis contrasts it is recommended to consult

https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
48 Source: http://scaryscientist.blogspot.com/2015/06/how-to-interpret-regression-analysis.html
61 © Structuralia
This is because the hypothesis contrast defines two alternatives, a "null hypothesis" (H0) and an
"alternative hypothesis" (H1). For this case H0 would correspond to say "the variable does not
influence the output"; Thus, when p-value < 0.05 we could say that this assertion is false and that,
therefore, the variable if it influences the output. Conversely, if p-value >0.05 it cannot be said to
be false and therefore that variable may not actually influence the output.
2.6 Assumptions on linear regression models
An important point to consider now is that linear regression models have a number of assumptions
that must be met in order to mathematically justify the conclusions of the model. These
assumptions are on the one hand assumptions in relation to model variables and on the other
hand are assumptions about model errors, also known as residuals.
▪ Linear relationship: One of the main assumptions is that the output variable and input
variables have a linear relationship, so that relationship could be captured with the
equations before views.
Figure 11 A clear example in which we can visually see that the relationship between the two axes is nonlinear,
so that it could not be represented with the linear models seen. Linear relationship assumption would not be
met49
49 Source: http://www.statisticssolutions.com/assumptions-of-linear-regression/
© Structuralia 62
One way to be able to check this, in addition to visually viewing the data (as in the graph
above) and reflecting on whether or not there is likely to be linear relationship, is to analyze
how the model residuals are distributed. If these are not randomly distributed but follow a
specific pattern, it will indicate that the assumption that the data is linearly related is invalid.
Figure 12 Example in which we can see how the residuals have a linear relationship and therefore the original data
would not have a linear relationship, but there would be no linearities.50
In general, linearity can be checked using different hypothesis contrasts such as Harvey-
Collier contrast, which are implemented in different software libraries.
50 Source: https://medium.com/@dhwajraj/learning-python-regression-analysis-part-9-tests-and-validity-
for-regression-models-78dcd5cde3a1
63 © Structuralia
Figure 13 the previous case, in this error plot if we can see that they apparently do not have any kind of
relationship.51
Solutions:
For these cases where the data does not have a linear relationship, the following can be
done:
1. Transform the input data with some statistical transformation method by

means of which the relationship between them becomes linear (this will be
seen in the nonlinear regression section).
2. Use a different algorithm.
▪ There is no autocorrelation in the residuals: A particularity that the linear

relationship of the data is not met occurs when strong correlations appear between
errors. That is, not only is there a possible pattern, but it also highlights a relationship with
the values we have previously had, seeing, for example, that there are cyclic patterns.
51 Source: https://zhiyzuo.github.io/Linear-Regression-Diagnostic-in-Python/
© Structuralia 64
Figure 14 In the image of the graph in red we can see how the patterns are repeated with a certain frequency. 52
In these cases, a fundamental assumption is violated: the data is not i.i.d. (independent
and identically distributed) so some values of a given variable are correlated with other
previous values of the variable.
o Identically distributed, it refers to the absence of increasing/decreasing trends in

the data and that they can be considered to come from the same distribution.
o Independent refers to each point in the data being a separate event from the
others and therefore there is no connection between them.
For example, if the variable were the price of housing for different months of several years,
we could see a cyclical evolution in relation to the rises and explosions of the housing
bubble and therefore the data would not be independent.
For these cases it is recommended to use other algorithms, such as time series, which
will be seen in the last part of the course.
52 Source:
https://dziganto.github.io/data%20science/linear%20regression/machine%20learning/python/Linear-
Regression-101-Assumptions-and-Evaluation/
65 © Structuralia
Again, the way to check this is to check for autocorrelation with the residuals by viewing
the list of all the residuals obtained and checking whether the residues of a given position
depend on residues from previous positions.
There are different hypothesis contrast tests, such as Durbin-Watson, which consider an
H0 that defines that the data does not have autocorrelation, so that if the p-value is less
than the threshold defined this hypothesis would be rejected an autocorrelation would be
detected in the data.
▪ Same variance in residues/errors (Homoscedasticity): Another factor to be checked

in relation to residues is that the distribution of residues must have the same variance
(Homoscedasticity). In addition to being able to see visually, as in the graph below, there
are, again, different tests, such as Breusch-Pagan or Goldfeld-Quandt, implemented in
statistical libraries in various programming languages that allow us to check it
automatically. In the case of this test, a hypothesis contrast is made saying that the data
are homoscedastic (H0) so that if a very low p-value is obtained, that hypothesis will be
rejected, and the condition will be said to be violated.
Figure 6415 Both residuals distributions have aspects of random distributions but the one on the right is
heteroscedastic as it can be seen as the variance of them increases progressively. On the other hand, on the right,
except for some punctual value in general, it remains homoscedastic.53
53 Source: https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-
linear-regression-model-mean
© Structuralia 66
Solutions:
1. One possible solution will be to transform the data with some statistical transformation
method (i.e., logarithmic transformation) and see if that stabilizes the variance.
▪ Normal distribution of residuals: In general, the process of solving the problem of

minimizing the quadratic error of errors is solved assuming that the distribution of the
residues/errors of the model follows a normal distribution and therefore its mean is zero
(or has a value almost zero) and its variance is constant. Therefore, it is necessary to
verify that errors can actually be assumed to come from a normal distribution. So, as the
following equations indicate, the error term is a random element obtained from a normal
distribution.
𝑦 − ⁡ 𝑦̃ = 𝑒⁡
𝑦𝑖 = 𝑏0 + 𝑏1 × 𝑥𝑖 + 𝑒
𝑒⁡~⁡𝑁(0, 𝜎 2 )⁡
This can be verified by representing the error values with a histogram and seeing if their
distribution looks normal or using other charts such as Q-Q Plots where we see how far
the data is moving away from a normal reference. Again, there are different hypothesis
contrasts, such as Anderson-Darling, to generically evaluate the normality of a dataset.
67 © Structuralia
Figure 6516 Example in which it looks like a histogram helps to see whether or not the distribution of errors is
normal.54
Figure 17 way to look at it, as noted, is with a Q-Q Plot55
54 Source: https://data.library.virginia.edu/normality-assumption/
55 Source: https://zhiyzuo.github.io/Linear-Regression-Diagnostic-in-Python/
© Structuralia 68
Figure 18 we are looking to verify is that we can assume that each of the errors can come from a random normal
distribution.56
▪ Predictor independence: One of the most critical aspects to consider in regression

models is that they assume that variables used as input variables are independent of each
other. That is, if the price of a house is being predicted using input variables such as its
size and location, the data of the variables used must not be correlated (that is, in this
case, that the size of the house has nothing to do with where it is). If there is a correlation
between the variables, one of them will have to be removed from the set of variables used
because the information we are providing to the model predictions is already contained in
the other variable. If the correlated variables (i.e., among which a phenomenon called
multicollinearity occurs) the least squares solution mentioned at the beginning with which
the coefficient values are obtained cannot be guaranteed, so that the coefficients will not
be correctly estimated. Thus, the biggest danger in this case is that the influence of each
of the particular variables would be mis calculating within the overall prediction by that
poor coefficient estimation.
56 Source: http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis
69 © Structuralia
It should be said that in general there will always be some correlation between the
variables (in fact, intuitively in the previous example we can see that there is), but what
has just been said will apply for cases where the correlation between them is very strong.
The way to check this is by calculating an array of correlations in which the rows represent
some variables and the columns others and either check that column-row pairs are close
to 1 (perfect correlation) or calculate the eigenvalues of the correlation matrix and see if
any of them are close to 0 (which also indicates multicollinearity).
Figure 19 Example of correlation matrix. In cases where we have values close to 1/-1 it indicates strong correlation
between those variables. Values 1 of the diagonal correspond to the autocorrelations of the variables themselves. 57
It is important to note that the fact that there is multicollinearity in a model does not make its
predictions in general worse (since it does not affect that the expected final value is better or
worse) but affects the interpretation of the coefficients that appear in the problem and the
individual contribution of each of them to the solution (but not to the overall contribution of all of
them; it may predict worse, but this is compensated with other predictors).
2.7 Additional considerations
▪ Treatment of outliers:
Outliers are values that appear as distinctly different from other available values. These
values will affect and distort the calculation of the mean quadratic error and therefore the
57 Source: https://stackoverflow.com/questions/30681832/how-to-create-a-symmetry-correlation-matrix-
easier-in-r
© Structuralia 70
calculation of the equation parameters. For this reason, they should be taken into account
either by their detection with an exploratory analysis or with their detection with statistical
methods such as Box-Plot diagrams or with more complex techniques of detection of
anomalies.
Each problem is different, and the source of outliers does not always respond to the same
reason. They may be malformed data and therefore need to be deleted (for example, a
point measurement of a sensor that is wrong. In this case it would be deleted and worked
with the other measures), or they may be data that does make sense, in which case it will
have to see if they will be allocated with a maximum or minimum reference value, they will
be left as they are...
Figure 20 Example where we can see the existence of two clear outliers in front of the rest.
This as such is not an assumption for linear regression since, in fact, we can work with
anomalies present, the only thing is that it will negatively affect the result obtained.
Treatment:
71 © Structuralia
1. Understanding why these values exist
2. Elimination of outliers (where applicable)
3. Allocation of reference values.
▪ High-leverage points:
This type of data is a particular case of outliers that are characterized by lacking other
nearby points or observations, so when trying to fit a model with that data the line will try
to pass near them and greatly affect the result obtained.
Figure 21 Example of data with high leverage. In red an equation is adjusted with them and in black without them. It
can be observed that they differ significantly.58
58 Source:
https://dziganto.github.io/data%20science/linear%20regression/machine%20learning/python/Linear-
Regression-101-Assumptions-and-Evaluation/
© Structuralia 72
These points may or may not affect the main assumptions of linear regression. Specifically, they
can cause residuals to no longer be normally distributed or randomly distributed. For this reason,
it is something that should be explored with caution.
Figure 22 For the above data it can be observed as in the case of maintaining the high-leverage points we have a
distribution of residues with patterns, while in the case of not having them disappears this trend.
Treatment:
1. Understand why these values are.
2. Delete or assign reference values to them (where applicable).
3. Use a more robust error function (i.e., Huber59).
4. Use more robust algorithms (i.e., RANSAC).
59 Source: https://en.wikipedia.org/wiki/Huber_loss
73 © Structuralia
3. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (II)
3.1 Logistic Regression
The models shown above serve regression problems because evaluating the values of a
particular set of input variables on the regression line results in a specific numeric value
associated with the output.
Could techniques similar to these models be used for classification problems? The answer is yes,
and this is achieved with a specific regression type called logistic regression.
The concept behind this model is to try to predict not a numeric value as the above methods do,
but to predict the probability that the dataset of the input variables is of a certain class between
two possible classes (binary classification).
To do this, the linear regression equation itself is used as a reference and its expression is
transformed so that the output is a value that meets Kolmogorov’s probability axioms; mainly,
make it a value between 0 and 1.
Imagining the case of the simple linear regression equation (although it is also valid for the
multiple), starting from it:
𝑦̃𝑖 = 𝑏0 + 𝑏1 × 𝑥𝑖
We can use the logistics function to express that result as a probability, because that function
always results in a well-lying value between a range of values of 0 and 1:
© Structuralia 74
Figure 23 Equation of logistics function
So, we would have equations for probability in the simple and multiple case the following:
Figure 24 Comparing a logistical equation versus simple linear regression60
60 Source: https://www.saedsayad.com/logistic_regression.htm
75 © Structuralia
The above equation can be defined in turn as follows:
𝑝(𝑋)
log ( ) = ⁡ 𝑏0 + ⁡ 𝑏1 × 𝑋
1 − 𝑝(𝑋)
Where the first term is called logit, so it makes it easier to quantify how the system evolves (in
reference to that logit) depending on how each X data varies.
Parameter adjustment could not be attempted in a similar way to how it was previously done with
the calculation of partial derivatives; because for nonlinear equations, such as these, the
derivation process is significantly complicated. It is more common, therefore, to use other
methods such as The Maximum Likelihood Estimation (MLE).
The MLE method seeks to be able to estimate the model parameters given the specific points of
known data by seeing which parameters maximize a function known as the likelihood function61.
Intuitively, this function seeks to maximize the probability of getting the classes of the output
variables from the training dataset given specific coefficients of the equation and given the known
data of the input variables.
Considering that the parameters are contained in the variable, the likelihood function is:
This function represents the probability of, with the available input data and generic parameters,
generating the different output classes62.
61 Source: https://www.stat.rutgers.edu/home/pingli/papers/Logit.pdf
62 The likelihood function is therefore a function that represents the probability of, given specific variables
and parameters, generating a specific output. The probability function will change depending on the given
problem to which the likelihood function is being applied, in this case being the probability function obtained
from the logistics function used in this regression. Moreover, a function that relates inputs-outputs-
© Structuralia 76
However, all this assuming that the data is i.i.d. so that the odds of having all the classes with the
available data corresponds to the product of the individual probabilities of having each of them.
This is equivalent to:
With:
Estimated parameters that would maximize the above problem would be expressed generically
as:
parameters has been defined by a different path and should be optimized to, in this case, maximize the
probability in question, but is already conceptually analogous to how it was resolved for least squares in
linear regression. In fact, the coefficients in the case of linear regression could also have been obtained by
MLE; In this case of logistic regression, however, it is easier by MLE than RSS.
77 © Structuralia
Because what we are working with is an optimization problem where we want to find the value of
the variables that maximizes the likelihood that classes will be generated with those variables and
input parameters, instead of using the likelihood function as such we can use the logarithm of that
function because logarithm is a growing function does not change the nature of the problem to be
maximized. In this case, the above equations could be expressed as follows:
Thus, the estimated parameters would be expressed as:
This likelihood equation can be expressed similarly as follows:
With:
© Structuralia 78
And the above likelihood equation, if resolved, would be expressed as:
As an example, for the simple case of a single variable, the resolution would be:
This would be the starting equation to optimize, from which we could get your first and second
derivatives as follows:
79 © Structuralia
These equations are not analytically resolvable and would therefore be solved numerically using
methods like Newton Method or a Gradient Descent.
Generally, the parameters obtained by optimizing the least squares problem relate to the input
and output values as follows (represented by matrix expressions):𝛽̃
𝛽̃ = ⁡ (𝑋 𝑡 ⁡𝑋)−1 𝑋 𝑡 𝑌
The last additional consideration to consider is that as is a classification problem, for a certain set
of input parameters the output will be either 0 or 1, corresponding to the dimension values of the
logistic equation.
Each time the equation is evaluated at new points we will get a value from the regression curve
that will be between 0 and 1. This value is a probability to be associated with one of the two
specific classes. One way to do this is to consider a threshold value such as 0.5 so that any
probability above 0.5 is associated as belonging to category 1, while if they are lower, they are
associated with category 0.
3.2 Multinomial logistic regression
This type of regression is an extension of the previous regression with which multiclass problems
can be considered. To do this, each of the different categories of the M output variables are
encoded in M - 1 vectors of length M in which all values are 0 except in any case one of them to
indicate the associated category.
© Structuralia 80
This defines a model for estimating, now yes, the different binary values that have been defined
with this transformation.
Figure 25 Example of encoding in multiple binary values
3.3 Nonlinear regression
On numerous occasions, as discussed, the data may be distributed nonlinearly and may not serve
the above. For these cases two possible solutions can be applied, transform the data so that in
doing so they become linear, or apply a polynomial regression function that allows to adjust the
data to a nonlinear function (since it can be quadratic, cubic...).
Data Transformation:
The main transformations that can be applied to the data are:
▪ Take the logarithm of the output values.
(𝑋, 𝑌) → (𝑋, log(𝑌))
In this way the function, for a case of a variable, would become:
log(𝑌) = ⁡𝑎 + 𝑏 × 𝑋⁡ → 𝑌 = ⁡ 𝑒 𝑎+𝑏⁡×𝑋 ⁡ → 𝑌 = ⁡𝛼⁡ × ⁡ 𝛽 𝑋
81 © Structuralia
▪ Take the logarithm of the input and output values:
(𝑋, 𝑌) → (log(𝑋) , log(𝑌))
With this:
log(𝑌) = 𝑎⁡ + 𝑏⁡ × log(𝑋) → 𝑌 = ⁡ 𝑒 𝑎 × ⁡ 𝑋 𝑏 = ⁡𝛼⁡ × ⁡ 𝑋 𝑏
▪ Take the inverse value of the output variables:
1
(𝑋, 𝑌) → (𝑋, )
𝑌
With this:
1 1
= 𝑎 + 𝑏⁡ × 𝑋⁡ → 𝑌 = ⁡
𝑌 𝑎 + 𝑏⁡ × 𝑋
© Structuralia 82
Figure 26 Examples to graphically see the effect of transformations.63
The models that would fit and be predicted would be the transformed models (which are
the linear models). When new data is received, the X values will be transformed if relevant
and an output will be obtained on which the transformation will have to be undone in case
the Y is transformed into the model.
Polynomial regression:
Polynomial regression follows the structure of equations already seen only that terms of order 2
or higher already appear in it:
𝑦̃𝑖 = 𝛽0 + 𝛽1 × 𝑥𝑖⁡ + ⁡ 𝛽2 × 𝑥𝑖⁡ 2 + ⋯
63 Source: https://people.revoledu.com/kardi/tutorial/Regression/nonlinear/NonLinearTransformation.htm
In this reference there are numerous transformations in addition to those mentioned
83 © Structuralia
With these equations we have a generalization about the linear models with which we can
potentially always get better adjustments in exchange for being more complex models of
numerical adjustment (larger orders for derivations, in addition to the complexity inherent in the
high number of parameters).
Figure 27 Example of a nonlinear dataset that is adjusted using the quadratic equation shown in the image. 64
3.4 Overfitting
Overfitting65
One problem that begins to occur mainly with these more complex models (and which is a danger
to consider in all supervised models, not just these) is the so-called overfitting or overfitting. It
may be the case that the model is too well adjusted to the data available but, in return, loses
64Source: https://onlinecourses.science.psu.edu/stat501/node/325/
65 Source: https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-
lasso-regression/
© Structuralia 84
generalization power and that therefore, when predicted on new data, predictions are getting
worse and worse.
Figure 28 Illustrative example of what is being exposed. In the first model, linear regression has been used for clearly
nonlinear data, so that we have a very subdued model to the data. However, in the model on the right, a polynomial
model of a very high order, it has been excessively adjusted to the available data. The optimal solution is to find a
balance between the two ends.66
Figure 29 analogous to the previous one with illustrative equations.67
66 Source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-
how-to-deal-with-it-6803a989c76
lasso-regression/
85 © Structuralia
Figure 30 The main problem is that a model that is under-adjusted will give data with a lot of bias and poor
assessments when it comes to metrics. An overfitted model surely of good metrics but the variance of the data that
that model estimates will be very high. Ideally, there is not much variance in the inferred data and that the accuracy of
the data is high (and therefore that the error is low).
The issue of overfitting will be analyzed in much more detailed at the end of this block, and
although there are numerous ways to avoid it, specifically for regression functions two
modifications can be used on the models seen, known as ridge regression and lasso regression.
3.5 Regularization in regression models
Ridge regression
As mentioned in the previous section, the solution to improve a model that has a high bias is
simply to add more variables to make the fit more precise. However, increasing variables
indefinitely will eventually reach a point where the model ends up overfitting. To avoid this, it is
generally a question of not using excessively complex models but also there are other techniques
to avoid such overfitted so-called regularization techniques including the ridge or L2 regularization
technique.
© Structuralia 86
Regression techniques serve in general to, once a specific system with a series of parameters is
defined, reduce the magnitude of influence of the coefficients so that it does not occur, as can
happen on certain occasions, that almost all the influence will occur on a particular group of them.
Figure 31 Influence of coefficients; it can be seen that some of them have significantly more influence than others in
this example.68
lasso-regression/
87 © Structuralia
In the specific case of ridge regression, a regularization parameter is added to the cost function
(for example, RSS in the case of linear regression) as shown below:
The objective of this regularization is to reduce the influence of the coefficients so that the
influence of all of them is more homogenized and potential overfits can be avoided due to an
overload of influence on certain coefficients in particular, as this could adversely affect the
subsequent generality of the model. This reduction in the influence of coefficients will be greater
the more the cost function is regularized/penalized, and therefore the higher the value of the
parameter. An optimal value will have to be found by minimizing that corrected cost function so
that we have as much generality as possible without losing efficiency in the model.
© Structuralia 88
Figure 32 can be seen how for a specific value of the coefficients is reduced. In this type of regularization none of
the coefficients, however much they are reduced, becomes zero, so the number of parameters of the model will
always remain constant.
Lasso regression
Another way to carry out a regularization is with so-called lasso regularization or L1 regularization.
Unlike the previous case that attempts to homogenize the influence of the different coefficients,
the lasso adjustment reduces to zero the influence of coefficients that it does not consider
significant in order to leave only those that are detected as relevant to have a model that
generalizes correctly. In this way what differentiates this method from the ridge is that it carries
out something known as feature selection with which the most relevant parameters for the model
are chosen in order to simplify it as much as possible.
89 © Structuralia
The expression of the cost function with this adjustment, which uses standard L1, is as follows:
Figure 33 can check how of all the variables we end up choosing only 4 as relevant.
© Structuralia 90
ElasticNet Regression
The ElasticNet regularization method is a hybrid method that attempts to combine the benefits of
the previous two methods, using L1 and L2 regularization.
ElasticNet attempts to better generalize models against Lasso by analyzing variable sets by
subgroups. If any of the variables are detected as a good predictor of the output variable,
ElasticNet includes the entire subgroup (unlike Lasso method that would include only one
variable). This is interesting because sometimes removing certain variables could lead to a less
effective model.
Figure 34 mitigating the influence of the coefficients while eliminating variables not relevant to the predictive model
(unlike ridge regression).
91 © Structuralia
Types of regularization
The two regularizations that have been seen, L1 and L2, lead to different solutions due to their
own nature.
In general, a regularization term is expressed as follows based on input variables:
To see this in a simple way, if we take the case of two variables, the above expression is left for
p-1:
Depending on the value of p the graphical representation for the case of two variables would be:
Figure 84
L1 regularizations use parameter p-1 and L2 p-2 (they are, in fact, the primary regularizations that
are used).
© Structuralia 92
If we wanted to see the problem of adjusting two variables for equations in the ridge case and in
the case Lasso, we would have the following for an example case:
Figure 35 In red the solution space of the variables appears and in blue the solution space for the terms of
regularization L1 (left) and L2 (right).
For minimization, in the solution space of the term of adjustment this will be minimal if the
parameters were null (coordinate source), but that solution is left out of the solution space of the
regression equation. The point where the least squares solution will be minimal is precisely where
both graphs intersect. Comparing both graphs, we can see that in L1 because the shape of it is
with edges it is much easier to intersect with the red solution space on one of the axes, which
would imply that that set parameter was worth 0 and therefore discarded. A parameter with the
graph in L2 to be worth 0 is something much more complicated as it would imply an intersection
of both spaces that just matched an axis.
3.6 Training data (train), validation (val) and test.
An important task in building ML models is to be able to evaluate the models and see how they
can better generalize. This evaluation of a model should be performed on new data that the model
has not seen before.
93 © Structuralia
In order not to wait for new data, what is usually done is instead of training the model with all
available data, reserving a percentage of it to make an evaluation at the end and see the efficiency
of the ML model.
Thus, datasets will usually be decomposed into:
▪ Train: This is the dataset with which the model is trained. It is usually the largest of the
three (so that the model can be adjusted correctly) and will include approximately 75-80%
of all available data.
▪ Val: It is not always necessary, but there are often internal parameters of the model that
must be pre-selected (known as hyperparameters) and the way to choose them is to check
which ones give the best results over this second dataset. It is usually between 5-15% of
the total size.
▪ Test: This dataset is the final evaluation set after the model has been definitively
chosen and specified(that is, when no further changes are to be made to it). It serves to
give a final evaluation of it and to be able to imagine how good its generalization will be
according to the mistake we have. Its size is usually 5-15% of the total set.
3.7 Example 1
This first example provides a set of data on a company's different employee salaries based on
their years of experience (regardless of profile). We could therefore look to build a simple
regression model in which with the YearsExperience input variable we can try to predict the
associated salary we will have (Salary).
# Import libreries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Load dataset
df = pd.read_csv('Salary_Data.csv')
© Structuralia 94
# Visualize data
df.plot(x='YearsExperience', y='Salary', title="Salary Evolution according to

Years of Experience") # Linear relationship can be seen
Figure 89
This is how the available data evolves.
A simple regression model will have the following equation:
𝑦 = ⁡ 𝑏1 ⁡ × 𝑥
However, before building the model, the data is separated into two subsets, one training to fit the
model and one test to be able to make evaluations later (for this simple initial problem the
validation set will not be used).
95 © Structuralia
# Input/output split
X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
# Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3,

random_state = 0)
And with this the model is built using the statsmodels library.
import statsmodels.api as sm
model = sm.OLS(y_train, X_train).fit()
Once the model is built, it is checked for the hypotheses mentioned above for linear regression.
First, we can view a summary of the model as follows:
model. summary()
© Structuralia 96
Figure 90
In this summary we can see how generic model information as well as specific information about
the coefficients of the model appear (in this case, as there is only one coefficient because there
is a simple linear regression only the row of x1 appears).
Starting with the analysis of the model parameters we can see how the coefficient value for the
regression model (coef) appears as well as the result of the p-value indicating whether or not this
coefficient is significant. When P > is less than or equal to the contrast alpha (usually 0.05) the
H0 is rejected which says that variable has no relation to the output variable and therefore that
variable is relevant in the model. However, when this is greater than the contrast alpha, it cannot
be assured that that variable is relevant. In this case, as a 0.000 appears (indicating that it is
practically 0) it is stated that it is a relevant parameter.
At the model level in general, several metrics are presented, such as AIC, BIC... indicating the
effectiveness of the model. These metrics can be used to compare models and see which ones
perform best. Specifically, AIC and BIC metrics are especially interesting as they penalize
complex models that use many variables, so they try to promote greater simplicity.
The various points mentioned in the hypotheses of a linear regression model are then checked.
97 © Structuralia
▪ Linearity:
Harvey-Colley
Checking for a linear relationship between the input and output variable can be done using
the Harvey-Colley test:
#### Linearity
# Check linearity
import statsmodels.stats.api as sms
sms.linear_harvey_collier(model)
# p-valor = 0.272 > 0.05 -> Does not reject H0=there is linearity
Since the obtained value is p-value s 0.272 > 0.05 H0 is not rejected which says there is
linearity.
▪ Residuals normality:
This is checked as follows:
#### Normality residuals
#Get residuals
from statsmodels.compat import lzip
residuals= model.resid
# Histograms
plt.hist(residuals, range=(-45000, 45000))
© Structuralia 98
Figure 33 Histogram of waste
# Q-Q Plot
import scipy as sp
fig, ax = plt.subplots(figsize=(6,2.5))
_, (__, ___, r) = sp.stats.probplot(residuals, plot=ax, fit=True)
print(r**2)
Figure 92 Q-Q Plot residuals
99 © Structuralia
The QQ Plot compares the amount of data available to the reference amounts of a
particular distribution, in this case, of a normal distribution. The metric used is the
correlation coefficient, which is the value of r2. The closer it is to 1 the more the
distributions will look like. In this case the value we get is 0.957, very close to 1, so we can
say that the residuals have a normal distribution.
In addition, it could be checked with the D'Agostino test.
# D'Agostino Test
from scipy.stats import normaltest
normaltest(residuals)
# p-value = 0.808 >> 0.05 -> Does not reject H0=normal data
We get a p-value of 0.808, well above 0.05 so we don't reject H0 which says the data is
normal.
▪ Homoscedasticity:
The next point is to check the Homoscedasticity of the residues. This can be checked
with the Goldfeld-Quandttest:
# Goldfeld-Quandt test
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid, model.model.exog)
lzip(name, test)
# model.model.exog -> Input parameters
# model.model.endog -> Output parameters
# p-valor=0.78 >> 0.05 -> Cannot reject H0=there is Homoscedasticity
© Structuralia 100
With this test we get a p-value-0.78 >> 0.05 so we can't reject H0 which says there is
Homoscedasticity.
Homoscedasticity can also be visually verified by representing how residuals is distributed

against the predictor value.
### Check Homocedasticity
fig, ax = plt.subplots(figsize=(6,2.5))
_ = ax.scatter(X_train, residuals)
Figure 93 Errors vs Predictor.
It is clear that there is no heteroscedasticity, but a potentially alarming situation also

appears; even if the linearity test between the data and the output has been passed, the
residuals do not have a random distribution, so we should look for some way to improve
the model.
▪ Autocorrelation residuals:
We can check whether or not autocorrelation exists within the data used by applying the
Durbin-Watson test to the residues obtained:
### Autocorrelation residuals
from statsmodels.stats.stattools import durbin_watson
101 © Structuralia
print(durbin_watson(residuals))
# Value close to 2 -> No correlation
# The value is bounded between 0 (max. Positive correlation) and 4 (max.

Negative correlation)
Since the obtained value is 2,683 (the test will give values between 0 and 4), close to 2, it can
be concluded that there is no autocorrelation in errors.
▪ Leverage:
We can also check if there are any lever dots that greatly influence the generalization of the
model. This is done as follows:
from statsmodels.stats.outliers_influence import OLSInfluence
test_class = OLSInfluence(model)
test_class.dfbetas[:5,:]
DFBetas gives the difference in each estimated parameter with and without the influence
points. There is one DFBEta for each datapoint; so, within observations and k variables there
are n*k DFBEtas
We can use 2 or 2/sqrt(n) as a threshold to see that DFBetas are significant and therefore
have a lot of influence.
from statsmodels.graphics.regressionplots import plot_leverage_resid2,

influence_plot
fig, ax = plt.subplots(figsize=(8,6))
fig = plot_leverage_resid2(model, ax = ax)
influence_plot(model)
© Structuralia 102
Figure 34 Influence diagram.
Figure 35 Influential lever points.
We have to be careful with observations that have a high leverage and high residues since
they are the ones that will be influencing the model the most.
There is nothing in the influence diagram in the upper right quadrant, but in that of the squared
residues it is observed that the observation n7 (x-10.5) significantly influences the generality
of the model. Because this it is an example problem, we will leave it as it is.
103 © Structuralia
This would create the predictions of the model and visualize the results obtained:
# Model predictions
y_pred = model. predict(X_test)
Evaluating the model with this test data can be done as follows:
# Assessment metrics
from sklearn. metrics import mean_squared_error, r2_score
r2 = r2_score(y_test, y_pred)
mae = mean_squared_error(y_test, y_pred)
print("r2: ", r2, "mae: ", mae)
r2: 0.8111 mae: 158298192.7350
R2 highlights the explainability of the model. The closer the model is to 1, the better the model
manages to explain/express the data. Value between 0 and 1. 0.8111 is good result -> 81.1%
explained.
# Visualizing the Training results
plt. scatter(X_train, y_train, color - 'network')
plt. plot(X_train, model. predict(X_train), color s 'blue')
plt. title('Salary vs Experience (Training Set)')
plt. xlabel('Years of Experience')
plt. ylabel('Salary')
plt. show()
© Structuralia 104
Figure 36 Regression line in relation to training data.
# Visualizing the Test results
plt. scatter(X_test, y_test, color - 'network')
plt. plot(X_train, model. predict(X_train), color s 'blue')
plt. title('Salary vs Experience (Test Set)')
plt. xlabel('Years of Experience')
plt. ylabel('Salary')
plt. show()
Figure 37 Regression line in relation to test data.
105 © Structuralia
We could try to improve the model using a constant along with the predictor and thus adjusting
the equation:
𝑦 = ⁡ 𝑏1 ⁡ × 𝑥
By repeating the above steps, we get the following:
▪ Summary:
Figure 98
This proves that the constant is significant for the model as its p-value does not exceed 0.005. In
addition, the AIC and BIC metrics obtained are lower than in the case without the constant, thus
improving the previous case.
© Structuralia 106
▪ Linearity:
With the Harvey-Colley test, we get pvalue-0.8325, so the linearity relationship to the output is
strengthened when we have entered the constant.
▪ Normal error:
Figure 38 Histogram residuals.
Figure 100 Q-Q Plot residuals.
The value obtained from r2 with the QQ Plot is 0.9596, and the p-value of D'Agostino is 0.260, so
the residues still have normal distribution.
107 © Structuralia
▪ Homoscedasticity residues:
Figure 101 Residuals vs Predictor.
With the Goldfeld-Quandt test we get a p-value of 0.7461, so the residues are homoscedastic. In
addition, it can be seen in the visualization of the residuals against the values of the predictor like
these as they are clearly distributed randomly and there is no relationship between them.
▪ Autocorrelation residues:
The Durbin-Watson test results in 2,683, so there is still no autocorrelation in the residues
obtained.
▪ Leverage:
In this case the influence diagram and lever points are:
Figure 39 Influence diagram.
© Structuralia 108
Figure 40 Lever Points.
This shows that the lever point that existed in the previous case has disappeared, better
generalizing the model.
The evaluation of the model results in:
r2: 0.9749 mae: 21026037.3295
Where we can see how the model has grown enormously in its adaptability (this being now
97.49%).
The visualization on the test data would look like this:
Figure 104
109 © Structuralia
3.8 Example 2
For this case, an example of multiple regression will be raised using the Boston Housing69dataset.
To load this dataset, we would do the following:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import datasets # datasets available at sklearn
data = datasets.load_boston() # load dataset BOSTON
# Dataset info
print(data.DESCR)
# Independent variables (input)
df = pd.DataFrame(data.data, columns=data.feature_names)
The variables that appear in this dataset are:
▪ and MEDV: Average value of homes occupied by their owners at $1000.
▪ X1 - CRIM: represents crime per capita by city.
▪ X2 - ZN: the proportion of residential areas in a given area.
▪ X3 - INDUS: the proportion of acres dedicated to retail business in the city.
▪ X4 - CHAS: binary variable (-1 if the tracks cross the river and 0 otherwise).
▪ X5 x NOX: concentration of nitric oxide (parts per million).
69 Source: https://www.kaggle.com/c/boston-housing
© Structuralia 110
▪ X6 x RM: average number of rooms per dwelling.
▪ X7 - AGE: proportion of buildings occupied by their owners, built before 1940.
▪ X8 - DIS: Represents the weighted distance to five employment centers in Boston.
▪ X9 - RAD: accessibility index to radial highways.
▪ X10 - TAX: the total value of the tax rate for $10,000.
▪ X11 - PTRATIO: represents the student-teacher ratio by city.
▪ X12 x B: value defined as 1000(Bk - 0.63)2 where Bk is the proportion of African Americans
in the city.
▪ X13 - LSTAT: low-class percentage in the population.
If all variables are used to build the model:
# Output variable
target = pd.DataFrame(data.target, columns=["MEDV"])
y = target ["MEDV"]
### Model with all variables
#X = df[["RM", "LSTAT"]]
X = df # all variables
# Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,

random_state = 0)
111 © Structuralia
As already indicated, the input variables must be independent of each other, and that is why it is
important to see the correlation between them:
# Correlation for all data
import seaborn as sns
correlation_matrix = df. corr(). round(2)
sns. heatmap(data=correlation_matrix, annot=True)
Figure 105
We can see as well as variables such as NOX and DIS have a high correlation (0.77) between
them, so we wouldn't need to both appear in the model.
Viewing the model summary:
model = sm. OLS(y, X). fit()
predictions = model. predict(X)
model. summary()
© Structuralia 112
Figure 106
It can be observed as the p-value of the INDUS, NOX and AGE variables are clearly greater than
0.05 so H0 is not rejected for them and therefore probably do not influence the output variable.
For this reason, they are discarded from the model and constructed with all other variables:
# INDUS, NOX, AGE are discarded
X = df.copy()
X.drop(['INDUS', 'NOX', 'AGE'], axis=1, inplace=True)
# Train/Test

random_state = 0)
113 © Structuralia
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Figure 107
Apparently, all the variables would be significant by now. If the predictions were now to be
obtained (in the absence of checking linear regression hypotheses first) the following would be
done.
# Predictions
© Structuralia 114
y_pred = model.predict(X_test)
# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
This would result in a model that as such does not give too good generalization. We should try to
further investigate the effect of variables on the model to assess which ones are really relevant.
The same model can be built using the scikit-learn library as follows:
### Model with scikit-learn
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.values.reshape(-1, 1))
# Model
from sklearn.linear_model import Linear Regression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Prediction of values
115 © Structuralia
y_pred = regressor.predict(X_test)
y_pred = sc_y.inverse_transform(y_pred)
y_test = y_test.values.reshape(-1, 1)
# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
Feature scaling
Many algorithms use distance measurements (i.e., Euclidean) to construct the relationships
between the variables used. This is why having the values on different scales is problematic
because a variable as per prices that could have values of type 100000, 120000... would influence
the model more than another of the age type with values of type 15,27,32...
For this reason, we recommend that we use functions that scale variables to the same range of
values to avoid having the problem described above.
For algorithms such as linear regression this would not be as necessary as the regression
coefficients themselves will correct the possible extra influence by the scale of values, but it is
also recommended to apply it for cases where the datasets are very large to save computational
cost and gain speed in training the models (for taking smaller values).
3.9 Example 3
In this case an example is raised using logistic regression on a set of customers purchasing data
where we have different features of each of the customers and we want to be able to predict
whether or not they will buy a particular product based on that information about70them.
70 Source: https://www.kaggle.com/rakeshrau/social-network-ads
© Structuralia 116
Customer data is:
▪ 'User ID': One ID for each customer.
▪ 'Gender': The gender of the client (Male/Female).
▪ 'Age': The age of the client (whole numerical variable).
▪ 'EstimatedSalary': Approximate salary of the client.
And the variable to predict is:
▪ 'Purchased': Its value is 0 if they haven't bought or 1 if they have.
This proposes a logistic regression model that allows customers to be classified into those two
categories (0-no-bought, 1-yes has purchased) based on their profile data. As an example, this
exercise will be performed using only Age and Salary to also display the data.
# Libraries
import numpy as np
import pandas as pd
# Dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2,3]].values # Predicting only Age and EstimatedSalary
y = dataset.iloc[:, 4].values
# Train/Test set

random_state = 0)
# Feature scaling
117 © Structuralia
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Model training
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
The following results are obtained:
# Predictions
y_pred = classifier.predict(X_test)
# Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
The performance of the model can be evaluated with the confusion matrix. Without now going
into details about it (it will be covered at one point later in the agenda) it is now sufficient to know
that the points of the diagonal of the matrix are the points that the model has classified correctly,
while the others are the points that have not been classified well. It can be seen how the model
correctly classifies almost all cases.
© Structuralia 118
Figure 41 Confusion matrix
119 © Structuralia
4. SUPPORT VECTOR MACHINES (SVM)
4.1 Description
Support Vector Machines (SVM) classification models are classification models that attempt to
define a decision border for a binary problem not based on all points (as was done in the linear
regression problem) but specifically based on particularly relevant points called support vectors.
Linear SVM classifiers
Hard-margin
The simplest situation would be to use an SVM to obtain the decision boundary of a classification
problem that was linearly separable. It is interesting to start from this case as it will allow us to
understand the fundamental concepts of this algorithm, which will be necessary to understand
later the most complex applications.
For this case we could have, for example:
Figure 42 Linearly separable dataset71
71 Source: https://en.m.wikipedia.org/wiki/Support_vector_machine
© Structuralia 120
This problem is seen to be linearly separable, although there are potentially infinite decision-
making border lines that could be used to perform the separation:
Figure 43 Example of several possible decision-making borders. H1 is a bad border because it does not classify the
points well, but which one to choose between H2 and H3?
It is proposed to choose as a decision border the one that is as far away from the data of the two
classes as possible. It is interesting that it is not very close to either of the two sets in particular
to minimize the number of data that is not classified well (if it is very close to either group there is
more likely to be poorly classified data).
121 © Structuralia
If the points before qualifying belong to two classes (yi = 1, yi = -1):
Figure 44 Display of the optimal separation hyperplane depending on the supporting vectors. This hyperplane
maximizes the m-distance between the planes associated with the support vectors by obtaining the value of m based
on the coefficients of the plane thanks to the formula of the distance from a point to a plane by calculating this relative
to the coordinate origin. The distance m is the sum of the distance from the OO to the plane wx – b s 1 plus the
distance to the plane wx-b -1, thus obtaining the value that appears in the image.
Thus, a plane is defined between the points so that the distance of the point is maximized to the
𝟐
points of||𝒘|| both classes that are closest to it (in the upper drawing, the points that fall on the
dashed lines). These points are the so-called support vectors and will actually be the only points
the algorithm will need to train and get the decision border.
Thus, for a dataset {𝑥1 , 𝑥2 , … , 𝑥𝑛 }and class labels for those 𝑦𝑖 ⁡ ∈ ⁡ {1, −1}x-values, we would have:
▪ 𝑤
⃗⃗ ⁡𝑥
⃗⃗⃗𝑖 − 𝑏 ≥ 1⁡𝑠𝑖⁡𝑦𝑖 = 1
▪ 𝑤
⃗⃗ ⁡𝑥
⃗⃗⃗𝑖 − 𝑏⁡ ≤ −1⁡𝑠𝑖⁡𝑦𝑖 = −1
Or what's the same thing:
𝑦𝑖 (𝑤 ⃗⃗⃗𝑖 − 𝑏) ≥ 1⁡𝑐𝑜𝑛⁡𝑦𝑖 = ⁡ {1, −1}

⃗⃗ ⁡𝑥
© Structuralia 122
That equation represents the correct classification of the points above the support vectors (that
is, that equation is what a point must meet for its classification to be correct).
Therefore, there is a problem to optimize (the distance to the support vectors) and a constraint
(the equation that meets the correctly classified points) so that:
𝟐 1
▪ Maximize ||𝒘||
~ Minimize722 ⁡ × ||𝑤||2
▪ Subject to 𝑦𝑖 (𝑤 ⃗⃗⃗𝑖 − 𝑏) ≥ 1⁡𝑐𝑜𝑛⁡𝑦𝑖 = ⁡ {1, −1}⁡

⃗⃗ ⁡𝑥
This optimization problem could be solved by constructing the Lagrangian function of that
optimization problem and deriving it from the two variable parameters to be adjusted, the value
of constant b and the coefficient of hyperplane w. For more details on how this issue is resolved
and what values are obtained for b and w we recommend reading [5].
In short, the Lagrangian function has to define the following optimization problem (known as
primal problem):
And partially deriving from w and b we reach the equations (with l the number of training points):
72 This conversion can be proposed since one of the problems is totally equivalent to the other because the
functional to maximize/minimize is a concave/convex optimization problem that has a single solution. This
transformation is done to simplify subsequent calculations.
123 © Structuralia
These equations could be replaced in the original minimization expression to get the following (a
function called dual problem):
Figure 45 Substitution in the original equations73
Figure 113
So, dependencies with w and b disappear. This is done as it will be easier to solve this
maximization problem than the original problem posed because the only thing that multiplies the
values of the a parameters are the scalar products of the different data vectors. These scalar
products will be in most cases either 0 or irrelevant values except for support vectors where the
73 Source: http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
© Structuralia 124
value will be the maximum relative to the other points, and therefore a reduction in dimensionality
would be being worked on. From this equation the maximum would be obtained by deriving and
equal to 0 for the different values of a so that the optimal hyperplane would already be known
since:
The new points that reached the system would be classified using the hyperplane equation with
the coefficients already calculated.
Soft-margin
These linear SVM classifiers can be extended to solve some nonlinear classification cases by
defining a classification border that has a certain tolerance to not classifying well some of the
points that would not be linearly separable.
Figure 46 Example where certain misclassified points are accepted in order to apply SVM linear separation to a
nonlinear border problem [6]
In this case, the optimization problem raised in the first case adds the error factor that 𝜉 seeks to
be minimized in order to have the best possible decision limit. Therefore, the problem can now
be expressed as follows:
125 © Structuralia
▪ 𝑤
⃗⃗ ⁡𝑥
⃗⃗⃗𝑖 + 𝑏⁡ ≥ 1 − ⁡ 𝜉𝑖 ⁡𝑠𝑖⁡𝑦𝑖 = 1
▪ 𝑤
⃗⃗ ⁡𝑥
⃗⃗⃗𝑖 + 𝑏⁡ ≤ −1 + ⁡ 𝜉𝑖 ⁡𝑠𝑖⁡𝑦𝑖 = −1
▪ With 𝜉⁡ ≥ 0slack 𝜉variables
Thus, we seek:
1
▪ Minimize ⁡× ||𝑤||2 + 𝐶⁡ ∑𝑛𝑖=1 𝜉𝑖 a constant with C to weight the relationship between the
2
end of the error and that of the maximum distance.
▪ Subject to 𝑦𝑖 (𝑤 ⃗⃗⃗𝑖 − 𝑏) ≥ 1 − ⁡ 𝜉𝑖 ⁡𝑐𝑜𝑛⁡𝑦𝑖 = ⁡ {1, −1}, 𝜉𝑖 ≥ 0⁡.

⃗⃗ ⁡𝑥
Solving this problem would go through again by using the Lagrangian function and getting partial
derivatives for w and b but now also for 𝝃. Again, it refers to [5] if we want to deepen the
mathematical resolution of this problem.
Nonlinear SVM classifiers
Previously viewed models are for linear classifications only and have a small application for
nonlinear cases where some margin of error is tolerated. However, SVM models gain much
greater power for nonlinear classification cases when combined with the so-called trick kernel.
Kernel Trick
As already noted in the chapter of linear regression models, on numerous occasions it is

necessary to use nonlinear models to define a boundary of separation of the appropriate space
and thus be able to address complex classification problems.
© Structuralia 126
Figure 47 Example of a nonlinear classification border for situations where points cannot be separated with a linear
border74.
These models are based on the intuitive idea that although it is not possible to separate complex
spaces like the previous one with linear boundaries, if it is possible to separate them with linear
borders if point spaces were defined in a vector space of a higher dimension.
This idea can be understood by the following example:
Figure 117
74 Source: http://37steps.com/faq/faq-prtools-disrep/
127 © Structuralia
If we wanted to separate the previous point space (1D) using a constant value (which would be
the "separation function" in 1D) we can see that there is no single point in space that separates
the red dots from the green dots.
Figure 118
However, if a higher dimension were used, such a function could be found that for particular
values the green value was had and for others red. This could be done with a parabolic equation
as below screenshot shown.
Figure 119
© Structuralia 128
For 2D (2 variables) it might happen that within the same plane a line that can separate the
assembly is not found but, nevertheless, for a higher dimension (3D) the space can be
separated by the corresponding linear equation (in this case it would be a plane). This is what
the following image illustrates.
Figure 48 When moving from the input space to a higher dimension space, the data is separable by a plane. 75
Figure 49 Another example of a nonlinear border obtained as a linear border in a higher dimension space [6]
75 Source: https://prateekvjoshi.com/2012/08/24/support-vector-machines/
129 © Structuralia
Figure 50 When moving from input space to feature space, the points become able to be separated linearly.76
The idea is therefore to transform an input space input space to a feature space output space:
▪ Starting from a dataset {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1
▪ Input space: the space of the different xi
▪ Feature space: The space of the ∅(𝒙𝒊 )
In this way, as we would have a hyperplane to separate the space when we move from the original
points x to the points ∅(𝑥)⁡we could proceed to adjust that hyperplane on those new points
similarly to how it has been adjusted in cases of soft or hard margin. However, this is costly to
execute as all training data would be transformed to a higher dimension on which the support
vectors would then be obtained, and the optimal hyperplane would fit. To avoid having to calculate
(and not having to actually transform∅(𝑥) to space to a higher dimension), a function called the K
kernel trick is used where the data appears as scalar products.
76 Source: https://en.wikipedia.org/wiki/Automated_machine_learning#/media/File:Kernel_Machine.svg
© Structuralia 130
In this way, in the equation that had to be optimized to calculate the optimal hyperplane:
It would not be necessary as such to move from the values of x to those since the only thing that
interests as ∅(𝑥) such are scalar products, which ultimately measure the similarity between two
vectors.
This measure of similarity is the one that will directly calculate the kernel function, thus expressing
the Lagrangian:
There are different Kernel functions to carry out this transformation, the most commonly used
being:
▪ Kernel lineal: 𝐾(𝑥, 𝑧) = ⁡ 𝑥 𝑇 𝑧
▪ Kernel polynomial:𝐾(𝑥, 𝑧) = (𝑐 + ⁡ 𝑥 𝑇 𝑧)𝑑

2
▪ Kernel radial RBF77: 𝐾(𝑥, 𝑧) = ⁡ 𝑒 −𝛾||𝑥−𝑧||
▪ Kernel sigmoid:𝐾(𝑥, 𝑧) = 𝑡𝑎𝑛ℎ(𝛾𝑥 𝑇 𝑧 + 𝑟)
For example, for a quadratic polynomial kernel (grade 2) we would move from one dimension 2
system to a dimension 3 system.
77 RBF: Radial Basis Function
131 © Structuralia
𝑧1 2
𝑇 2
𝐾(𝑥, 𝑧) = ⁡ (𝑥 𝑧) = (𝑥1 𝑧1 + ⁡ 𝑥2 𝑧2 )2
= ⁡ [𝑥1 , 𝑥22 , √2⁡𝑥1 𝑥2 ] [
2 𝑧2 2 ] = ⁡ ∅(𝑥)𝑇 ∅(𝑧)
√2⁡𝑧1 𝑧2
This has built a space in which the upper dimension is an artificial variable that really depends on
the values of the other variables but allows to make that decomposition of space easier.
Therefore, taking into account all of the above the idea would be to transform, first of all, the input
space to a feature space that was as linearly separable as possible, and already on it apply some
of the techniques seen before as soft margin or hard margin to find the border of optimal
separation.
With that, we have the following point space with the decision border indicated there:
Figure 51 Nonlinearly separable dataset78
The first approach to resolving this could be to use a high-grade polynomial function to define a
complex nonlinear decision border.
78 Source: https://rasbt.github.io/mlxtend/user_guide/feature_extraction/RBFKernelPCA/
© Structuralia 132
However, resolving this equation, as such, is an extremely complex and computationally

expensive process. Therefore, instead of resolving this problem, features are defined as follows:
▪ f1=x1,f2=x2,f3=x1x2,f4=x12,f5=x22 ...
What we would have:
With this method it is proposed instead of performing an operation as expensive as adjusting a

high-grade equation (its partial derivatives will be complex equations to solve) to study the
similarity with respect to specific points called landmarks so that, on the one hand, a number of
m variables are used depending on the landmarks that are defined, and on the other hand, instead
of actually projecting to a higher dimension space, only measures of similarity between points and
landmarks (depending on the defined kernel) are analyzed, avoiding actually having to project
into a higher dimension space.
Example:
If for example we had 3 points such as landmarks: l(1), l(2), l(3),the similarity of these with respect
to a particular point can be defined as:
▪ f1 = similarity (x, l(1))
The measure of similarity is calculated with one of the previously defined kernel functions, such
as with the RBF.
2
∑𝑛
𝑗=1(𝑥𝑗 −⁡𝑙𝑗 )
2
−
▪ Similarity(x, l) = 𝐾(𝑥, 𝑧) = ⁡ 𝑒 −𝛾||𝑥−𝑧|| = ⁡ 𝑒 2⁡×⁡𝜎2 ⁡
133 © Structuralia
In cases where point x is close to the landmark, it will occur that:
▪ 𝑓1 ⁡~⁡𝑒 0 ⁡~⁡1
On the other hand, if the point is far removed from the landmark:
▪ 𝑓1 ⁡ → 0
In this way, each landmark defines features that apply to the different values of x and that will
classify them according to the measure of similarity, which for the case of the RBF depends on
the parameter 𝜎 2 defined. 𝜎 2 defines how wide or narrow the kernel is.
Thus, by way of example and for these same three landmarks, we could have the following
decision border for a few parameters:79
Figure 52 x1, x2, x3 are the points to be classified. l(i) are the different landmarks.
79 Source: http://mlwiki.org/index.php/Support_Vector_Machines
© Structuralia 134
Having the 3 points shown in the image above, they would be classified into one of the categories
(0/1) using the following function with the example parameters indicated above:
▪ For point x1:
This would predict that this point would belong to class y-1.
▪ For point x3:
This would be predicting that this point would belong to the class and 0.
Generically, a series of landmarks m will be defined with which we will have:
∑ 𝜃𝑗 × ⁡𝐾(𝑙𝑗 × ⁡ 𝑥𝑖 ) ⁡ ≥ 0 → 𝑥𝑖 ⁡𝑖𝑠⁡𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑⁡𝑡𝑜⁡𝑐𝑙𝑎𝑠𝑠⁡𝑦𝑖 = 1⁡
𝑗=1
∑ 𝜃𝑗 × ⁡𝐾(𝑙𝑗 × ⁡ 𝑥𝑖 ) ⁡ ≤ 0 → 𝑥𝑖 ⁡𝑖𝑠⁡𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑⁡𝑡𝑜⁡𝑐𝑙𝑎𝑠𝑠⁡𝑦𝑖 = 0⁡
𝑗=1
135 © Structuralia
And so, depending on the coefficients that relate the different landmarks and the similarity
obtained (depending on the kernel used) between a specific point and the existing landmarks,
that point will be classified into one of the two available categories.
The training of these models will be used to determine the value of the different coefficients 𝜃⁡that
relate the different expressions of similarity.
Figure 53 Example of different decision-making boundaries depending on the kernels used80
SVM for regression: SVR
The SVM model can be adapted to perform regression tasks. To do this, it is about adjusting the
model as in a linear function as follows:
80 Source: https://www.quora.com/Classification-machine-learning-How-do-you-determine-which-type-of-
kernel-to-use-in-SVM
© Structuralia 136
Figure 54 Representation of an SVR setting81
Thus, an attempt is made to adjust a linear equation but taking into account a certain tolerance
𝜖⁡so that data within that tubular region is not considered supporting vectors, and thus seeks to
reduce the number of data used as support vectors.
Otherwise, the adjustment of the line is done similarly to how the hyperplanes were adjusted in
the case of classification.
We can also add the concept of soft margin that already appeared in the classification problems,
as indicated in the following image:
Figure 129
81 Source: http://www.saedsayad.com/support_vector_machine_reg.htm
137 © Structuralia
Alternatively, and generically speaking, it can help to see it from the usual perspective of
regression problems: in the case of SVR it is about adjusting a regression plane such that it
minimizes a cost function, as seen in the following image.
Figure 36 Cost function to minimize for SVR case including regularization term (optional).82 f(xi,w) represents the
kernel function used.
SVM classifiers for multiclass problems
All of the above classification cases have been raised for binary classifications, but SVM
algorithms can also be used for the multiclass case.
There are two main ways to do this:
One-against-one:
▪ Several classifiers are defined where each of them receives a sample of data from two
classes across the training set and thus learns to distinguish between those two classes.
In the end, a voting scheme is proposed among all classifiers for a new data received so
that it is classified according to the most voted class among all defined binary classifiers.
▪ For example, we have 3 binary classifiers, the first distinguishes between the Child and
Adult classes, the second between Adult and Elder and the third between Child and
Elderly. When a new data is received, the first classifier could say that it is a Child, the
second that is an Adult, and the third that is a Child. In this way, Child is the class with the
most votes and the new data would be classified within it.
82 Source: https://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
© Structuralia 138
Figure 37 One-Against-One Scheme83
One-against-the-rest:
▪ A single classifier is trained for each of the classes.
For example, if the classes are Child, Adult, Old, a classifier would have two classes:
Child and Adult, Elder, so that it would group all non-class data into another class.
▪ The class is chosen most likely of all
83 Source: https://towardsdatascience.com/multi-class-classification-one-vs-all-one-vs-one-94daed32a87b
139 © Structuralia
Figure 132 One-Against-The-Rest84
Final notes
Model parameters
In summary, the input parameters of an SVM model for classification are:
▪ The C value that relates the equation to minimize with the end of the error. This parameter
1
C is the inverse of the term of regularization, so thatC = ⁡ λ
o A high value of C reduces bias, increases variance, and makes the model more
prone to overfitting.
o A low C value increases bias, reduces variance, and makes the model more prone
to underfitting.
▪ If it is used, for example, an RBF kernel would appear an additional parameter to predefine
in the model, 𝜎 2
o A high ⁡𝜎 2 value causes features to vary more𝑓𝑖 smoothly and therefore there is
greater bias and less variance.
84 Source: https://towardsdatascience.com/multi-class-classification-one-vs-all-one-vs-one-94daed32a87b
© Structuralia 140
o A low ⁡𝜎 2 value causes features 𝑓𝑖 to vary abrupter and therefore there is less bias
and greater variance.
Advantages
▪ There are no local lows as it is a convex problem.
▪ Scale well with high-dimension data as it will always depend on the support vectors
defined.
▪ We can have a compromise between the complexity/effectiveness of the classifier or a

regressor by controlling the C and epsilon parameters.
▪ Similarly, we can control the overfit.
▪ The results obtained are robust.
Figure 55 Example of models with overfitting, underfitting, and acceptable fit.85
Disadvantages
▪ It's not easy to find the optimal relationship between C and epsilon.
▪ Sometimes it is not trivial to determine the best transformation of the data.
85 Source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-
how-to-deal-with-it-6803a989c76
141 © Structuralia
4.2 Example 1
For this issue, we will use the same dataset used for customer purchase classification based on
your user profile parameters.86
First, a binary classification model (Purchased-0/1) is constructed for the variables X0-Age and
X1-EstimatedSalary using a linear kernel.
# Libraries
import numpy as np
import pandas as pd
import math
# Dataset
X = dataset.iloc[:, [2, 3]].values
# Train/Test

random_state = 0)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
86 Source: https://www.kaggle.com/rakeshrau/social-network-ads
© Structuralia 142
X_test = sc.transform(X_test)
######### Linear Kernel
# Model training - Kernel Linear
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
# Predicting the Test set results
# Making the Confusion Matrix
##### Visualization
# Conversion to series and define target values
y_training = pd.Series(y_train)
X_training = X_train
target_names=['0','1']
# Define the limits of the visualization
lbX1=math.floor(min(X_training[:,0]))-1
ubX1=math.ceil(max(X_training[:,0]))+1
lbX2=math.floor(min(X_training[:,1]))-1
ubX2=math.ceil(max(X_training[:,1]))+1
[lbX1,ubX1,lbX2,ubX2]
143 © Structuralia
# Label indices
idxPlus=y_training[y_training==0].index
idxMin=y_training[y_training==1].index
# Representation of the values of each class with a different color b / r
plt.scatter(X_training[idxPlus,0],X_training[idxPlus,1],c='b',s=50)
plt.scatter(X_training[idxMin,0],X_training[idxMin,1],c='r',s=50)
plt.legend(target_names,loc=2)
X,Y = np.mgrid[lbX1:ubX1:100j,lbX2:ubX2:100j] # meshgrid; 100x100 matrix with

divisions between defined limits
Z = classifier.decision_function(np.c_[X.ravel(),Y.ravel()]) # Drawing the

decision boundary for those points of the meshgrid
Z = Z.reshape(X.shape) # Define the border as another 100x100 meshgrid
plt.contourf(X,Y,Z > 0,alpha=0.4) # Outline drawing
plt.contour(X,Y,Z,colors=['k'], linestyles=['-'],levels=[0])
plt.xlabel("Age")
plt.ylabel("EstimatedSalary")
plt.title('Linear Kernel (train)')
© Structuralia 144
The visualizations obtained (on training data) and confusion matrix are as follows:
Figure 56 Graph with linear kernel decision border.
Figure 57 Confusion matrix for linear kernel case.
It can be seen how some of the points fail to qualify well. The next step would be to try to improve
the model using a kernel that allows us to obtain a nonlinear decision border.
# Model training - Kernel RBF
classifier = SVC(kernel = 'rbf', random_state = 0)
145 © Structuralia
For this case, the visualization on the same training data is;
Figure 58 Decision border for an RBF kernel.
And the confusion matrix about the test data is:
Figure 59 Confusion matrix for an RBF kernel.
Thus, it can be seen that the RBF border improves the classification power of the model.
4.3 Example 2
In this case a regression model is proposed using SVR. For this example, it will be used, in order
to help the visualization of decision boundaries, the problem described on the scikit-learn page
[7].
© Structuralia 146
This problem generates random data already scaled and constructs the regression line for 3
different kernels, one linear, one grade 2 polynomial, and one RBF. The result obtained is the one
indicated by the page:
Figure 60 Different regression equations depending on the kernel used87
87 Source: http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
147 © Structuralia
5. DECISION TREES
5.1 Initial intuition
Decision trees are other ML models that can be applied to classification and regression problems.
In the classification problem the main idea of these models is to define a system of rules inferred
from the training data to classify according to them the new data that reaches the system.
Figure 38 Example of simple classification with a decision tree that seeks to classify whether a user will leave the
company (No) or yes (Yes). From the training data are inferred the rules that appear, so if for example we want to
know if a user from Spain and 22 years old will leave the company will be classified within the category of yes, but
nevertheless one of 53 years will not leave. On the other hand, a client from Portugal regardless of their age would
be classified as going to leave.
This classification idea can be moved to a graphical visualization of the space of points where
decision boundaries are defined according to the rules that have been inferred.
© Structuralia 148
Figure 39 Decision borders on the data map based on the above rules. In red areas the No and green areas are
classified the Yes.
The regression problem is, as usual, part of the idea of the classification problem only that instead
of assigning each point a class based on the section of the space in which it falls, it is assigned
the mean or median values of the data that are contained in that region.
Figure 40 Simple example of a regression problem to calculate the price of the house based on its size and age. The
inferred value will be calculated based on the mean of the quadrant training values where a new point is located. In
the event that a new point was between 75 and 110 m2 in size and its age was between 15 and 5 years old it would
be assigned a price of 937500 euros.
149 © Structuralia
Thus, these algorithms decide, based on their training data, where to perform the divisions of the
data, so that the training points that are well classified are optimized (the detailed explanation of
how this is done will be addressed in the next point).
We will have to be careful, however, to make too many divisions since that is one of the reasons
that can cause more overfitting.
Figure 41 Another example with more rules and decision boundaries where training points are classified within the 3
88Defining
classes listed there (red, blue green) for the two variables indicated on the axes. decision boundaries so
small to rank all points, as in this case where all training points have been well classified, can be one of the reasons
for subsequent overfit. It can be clearly seen how decision boundaries are non-linear.
88 Source: http://stephanie-w.github.io/brainscribble/classification-algorithms-on-iris-dataset.html
© Structuralia 150
A final detail is to comment out the nomenclature used in these systems, using the following
example reference:
Figure 42 of a decision tree for sorting whether or not to play tennis based on different features. The features
represent the set of weather information.
The main elements of the tree are:
▪ Inner nodes: Attributes/features used belonging to the training data. For example, in the
above case, sky is an attribute (how the sky is) or humidity (what is the humidity of the
environment).
▪ Arcs: Possible values of the source node. For example, the sky can be sunny, cloudy or
rainy.
▪ Leaves: Classification values in the different classes of the problem. In this case it would
be to play (yes) or not play (no).
▪ Rules: All of the above would determine classification rules for different data. For the
previous case the rule system would be as follows.
151 © Structuralia
(Sky - Sunny - Humidity - High -> Play_Tenis - No)
\/ (Sky - Sunny - Humidity - Normal -> Play_Tenis - Yes)
\/ (Sky - Cloudy -> Play_Tenis - No)
\/ (Heaven - Rainy - Wind - Strong -> Play_Tenis - No)
\/ (Heaven - Rainy - Wind - Weak -> Play_Tenis - Yes)
5.2 Description
This section will more precisely detail how space divisions are created and thus define divisional
boundaries.
One of the algorithms commonly used to build decision trees is the ID3 algorithm (Iterative
Dichotomiser 3). The pseudocode of this algorithm is shown below for a case of two categories
of output labels (+/-):
ID3 (Train Data, Target/Variable Output, Features of Input Variables):
1. A root node is created
2. If all data is of class + a single node labeled with the + class is returned ⁡→
3. If all data is of the class - a single node labeled with the - ⁡→
4. If the features field of the input data is empty, a node labeled with the most frequent value of
the classes of the associated output variables within that train set is ⁡→returned.
5. Otherwise:
5.1. Attribute A is chosen that best classifies the train data for that node A→
5.2. A tree is created with a node = A (labeled with A)
5.3. For each possible value vi of A:
5.3.1. A new arc is added under the node, labeled with vi
5.3.2. Data_Train(vi) is the subset of data that has the vi value in attribute A
5.3.3. If Data_Train(vi) == Empty:
▪ A node labeled with the most common class among train data is placed under
this arc.
© Structuralia 152
5.3.4. If Data_Train(vi) != Empty:
▪ Under this arc creates an ID3 sub-tree (Data_Train(vi), Target, Features –

{A}).
5.3.5. End.
5.3.6. Return to root node.
Thus, in general, the idea is to choose at every moment that feature A to use to partition the space
until we reach a terminal node where all the data belongs to the same category.
Figure 43 ID3 Application Example. We start by choosing a Country of Origin attribute A (Age in this case) and it is
used to create a new ID3 from it. In this new ID3 as the two possible variables (>25 years and <25 years) would
result in a Data_Train with categories of only one of the classes, those categories are assigned to those leaves, and
the root node would be returned to follow by the other nodes. The algorithm has decided to use those features to
partition and not use the Sex feature to do so.
153 © Structuralia
In this way, a fundamental point in the algorithm is 5.1, that is, choose the attribute that best
classifies the training data. This can be done with several criteria, but one of the best known is
the concept of information entropy.
Entropy and Information
The main idea is that those with the highest gain of information will be chosen as nodes, linked
to better classifying the available training points.
Figure 144 This image would be chosen as a variable to make partition X1 before X2, and specifically dividing for that
particular value, as it is the division that maximizes the points in each category separated for each zone. In the event
that X1 was a categorical variable a would be one of the existing categories, and if numeric it would be any value that
chooses the algorithm that manages to better separate the two categories of data.
© Structuralia 154
Figure 45 second partition over the previous space would find that value of X2-b as the ideal value to maximize the
points in each zone already separated.
How to quantify these separations is measured by the entropy formula:
𝐻(𝑋) = ⁡ − ∑ 𝑝(𝑥𝑖 ) × 𝑙𝑜𝑔2 𝑝(𝑥𝑖 )⁡

𝑖=1
With n the number of existing classes and p(xi) the probability of belonging to each of them.
Figure 46 For this example the calculation of entropy would be:
155 © Structuralia
𝐻(𝑋) = ⁡ − ∑ 𝑝(𝑥𝑖 ) × 𝑙𝑜𝑔2 𝑝(𝑥𝑖 ) = ⁡ −0.5⁡ × 𝑙𝑜𝑔2 (0.5) ⁡ − 0.5⁡ × 𝑙𝑜𝑔2 (0.5) = 1⁡
𝑖=1
Figure 47 this case entropy would be:
𝐻(𝑋) = ⁡ − ∑ 𝑝(𝑥𝑖 ) × 𝑙𝑜𝑔2 𝑝(𝑥𝑖 ) = ⁡ −1⁡ × 𝑙𝑜𝑔2 (1) ⁡ − 0⁡ × 𝑙𝑜𝑔2 (0) = 0⁡

𝑖=1
Entropy will have a value between 1 (maximum entropy, case of maximum uncertainty) and 0
(minimum entropy, case of maximum information). The case of better separation of space is one
that provides, after separating the data, the least entropy.
This is defined, depending on entropy, the terms of expected entropy and expected information
gain:
▪ The resulting expected entropy after using a feature A in the tree is:
|𝐷𝑣 |
𝐻⁡𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = ⁡ ∑ ⁡ × 𝐻(𝐷𝑣 )
|𝐷|
𝑣⁡𝜖⁡𝑉𝑎𝑙𝑢𝑒𝑠⁡(𝐴)
▪ The expected information gain after using a feature A in the tree is:
© Structuralia 156
|𝐷𝑣 |
𝐺𝑎𝑖𝑛(𝐷, 𝐴) = 𝐻(𝐷) − ⁡ ∑ ⁡ × 𝐻(𝐷𝑣 )
|𝐷|
𝑣⁡𝜖⁡𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
▪ With Dv, the subset of valued D samples in the A-v feature. That is, the
expected entropy is calculated based on the entropy of the points in subset Dv
and the categories to which they belong, and then this result is weighted based
on the points in the subset versus those of the original problem. On the other
hand, the gain is calculated based on the entropy in the original set minus the
new (expected) entropy of subset Dv. In this way, the partition based on feature
A that produces the most gain (that is, what less expected entropy it has) will
be the one that the system chooses for the partition mentioned in point 5.3. of
the ID3 algorithm.
Example
The following example proposes a decision system for a classification of whether or not to play
tennis (binary classification) based on 4 input variables (Sky, Temperature, Humidity, Wind) which
are categorical variables that will have between 2 or 3 labels, as the case may be. It will start from
a historical dataset with the time it's had in 14 days gone by, data that will be used to train the
model. The data was obtained from [8].
Figure 48 Training dataset in which the 4 variables have an associated category of Play Tennis Yes (1) or No (0).
157 © Structuralia
Thus, the first step taken by the ID3 algorithm is to choose as a variable for its first node the one
that of a greater gain of information regarding the entropy of the starting data of the model (the
14 points with its different classes; in this case are 5 data of No and 9 of Yes). For each of the
possible variables we can see the gain that would remain when comparing that initial entropy with
the entropy resulting from separating the data according to the v-values that that variable A can
take.
Figure 150 The image shows the resulting separations for the three variables, along with the entropy gains obtained.
It can be seen how the best case is that of the Sky variable which provides the greatest gain of information. The
first split of the model would be based on that variable and its characteristics.
The next step will be, if the data left in the split can be separated with the chosen variable A (that
is, that the dataset is not null or that it is not of a single class) an ID3 tree is created that hangs
from that node and makes a new split considering the variable of the remaining variables that
best divides the space of that data.
© Structuralia 158
Figure 49 The space of the data belonging to the Sky-Sunny category is seen to be better separated with the
Humidity variable by giving a greater gain of information. Because Sky has been used at a higher level of the tree,
that variable is not reused in the lower splits.
After the splits have been terminated because total gaps have been reached between the existing
categories with the values in the variable used, it is returned to the root node to scan the partitions
of the space using one of the other unshowed variables.
Figure 50 As Sky-Cloudy ends up in a class with all the data classified within the same class (terminal node) the partition
for Sky-Rainy is explored using the other three available variables: using, again, Humidity for that case, or using the
other two variables: Temperature and Wind. We can see that Wind is the one that gives the highest gain and that, in
addition, it already ends up directly on terminal nodes by perfectly classifying the classes of that subset.
159 © Structuralia
After iterations with the ID3 algorithm, we would have a decision tree already perfectly formed.
Illustration 51 Final Decision Tree
Advantages
▪ Decision trees allow us to define nonlinear boundaries, which are very useful for
classifying or regressing problems where optimal decision boundaries are complex.
▪ We do not need to scale the input variables.
Disadvantages
▪ By not having re-checks on decisions made in a previous step, this algorithm is likely to
end up trapped in optimal premises; that is, in a solution that although optimal may be far
from the solution that is best for globally available data.
▪ Another danger in these algorithms is the fact that they are susceptible to overfitting or
overfitting. Overfit is given when we have an h solution that better classifies training data
than another h' solution but classifies worse new data. Overfit can be caused by:
o Misclassified some of the input data.
o Having done system training with small datasets.
© Structuralia 160
o Having tried to classify to the end based on existing attributes (that is, we have not
left any unclassified data within your class). Forcing the algorithm in this way can
cause boundaries to be defined based on attributes and values of borders that
might seem relevant within the training data but then are not generally relevant.
Model improvements: tree pruning
Some of the problems described above can be solved using some improvement techniques in
this ID3 algorithm (which is still the simplest case of decision tree), one of them being the
technique of pruning trees.
The tree pruning technique is intended to provide retrospective examination of the decisions made
by ID3 to assess when they actually provide a decision-making border that is generalizable and
when they do not. To do this they do the following:
Tree pruning algorithm:
1. The input dataset is divided into train and test.
2. Current tree - tree chosen by ID3 using train data.
3. Average - proportion of test data that the current Tree correctly classifies.
4. Continue = True.
5. While Continue:
1) For each inner node N in the tree:
i. The tree is temporarily pruned on node N and replaced by a terminal node

(leaf) labeled with the majority classification for that node.
ii. Measure, with this modified Tree, the proportion of test data that is correctly
classified -> Average1.
2) With K the node that produces the best results after pruning:
i. If Media1 (scoring of this modified tree) > Media (original tree scoring) ->
Current tree - K-pruned tree.
ii. If not -> Continue – False.
6. Return Current Tree.
161 © Structuralia
This way it is a question of simplifying the models (Ockham knife principle) so that they can
generalize better. As a general criterion, simpler models will always be preferred over the most
complex ones.
5.3 Random forest
So far everything seen on decision trees tries to build a single tree using some algorithm like ID3.
Can the proposed solution be further improved? The answer is yes and for there are
improvements such as that provided by the random forest algorithm.
The intuitive idea behind it is to use several decision trees together to try to alleviate the effect of
local lows, overfitting... If there are multiple models trained on the original data, the weighted
predictions of all of them will be predictably more accurate than those that a single model will
make, which will always be more likely to be wrong for the reasons already stated.
These models are a type of model called ensemble learning because they consist of combinations
of several different independent models (a point will be dedicated later to treating such models in
more depth).
Thus, in a simple way, each individual trained tree would classify a new data into one of the
possible categories and combine all those individual predictions to give a percentage of
membership in each of the different classes. Thus, if we are using 3 trees for classes 0/1 and 2
classified in category 0 and one in 1, the system prediction would be 0 x 66% 1 x 33%, and we
could choose the class with the highest score as the label of that data.
In the case of regression, the problem is similar; each tree would give a prediction and combine
all predictions through a calculation of the mean of the predictions (or better, the median to avoid
outliers) to give the final prediction.
© Structuralia 162
Illustration 52 Scheme of a Random Forest89
In general, the pseudocode that these systems would follow is:
1. It defines the number of trees to use.
2. For each tree:
▪ Random training 90 data set is taken.
▪ A decision tree is built with that k data.
3. For each new point (prediction) the individual prediction is taken from each tree and the
percentage (classification) or mean/median values (regression) is obtained.
89Source: https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/
90The idea of taking subsets of random data is to try to explore as much as possible the entire space of
valid solutions to, for example, avoid optimal local ones and have the best possible generality.
163 © Structuralia
5.4 Ensemble learning
Random Forest techniques are known as bagging, where several independent models are trained
in parallel. Another alternative is to use techniques called Gradient Boosting, where multiple
models are trained in series, where a first model is trained in the data, and subsequently the
models try to minimize the error of the previous models. Within these algorithms there are
examples such as AdaBoost, XGBoost, GBM, LightGBM or CatBoost.
These models are generally very efficient, although sometimes they contain a lot of
hyperparameters that are not trivial to adjust.
Illustration 155 Bagging vs Boosting91
91 Source: https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting
© Structuralia 164
5.5 Example 1: Decision tree
This example will use the same dataset used in the previous examples to predict whether or not
a customer will leave the company based on a series of parameters. For a first example, which
also compares with the other models, the Age and EstimatedSalary variables will be used.
# Libraries
import numpy as np
import pandas as pd
import math
# Dataset
# Splitting the dataset into the Training set and Test set

random_state = 0)
These algorithms, as already mentioned, do not need to scale the input variables by not
calculating their decision boundaries based on the use of distances (i.e., Euclidean, Manhattan...).
However, in order to speed up computing, it is often a good practice to reduce the numerical value
of the input variables, which is why it will be done next.
165 © Structuralia
# Feature Scaling -> Not necessary as it is not based on Euclidean distances
# but can be used to speed up calculations
Once this is done, the model is trained:
# Model training
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', # criteria used
max_leaf_nodes = None, # if you want to

define a maximum of leaves per node
min_samples_split = 2, # minimum number of

data to do a split
max_features = None, # If you want to

define a limit of features used
random_state = 0)
Within the model it is interesting to observe the hyperparameters used; among them we can
define the criterion to make the splits (in this case the entropy criterion has been used, because
it is explained in the theory, but there are other widely used ones, such as the Gini). We also
define the maximum number of terminal leaves to create (None to set anything by default), the
minimum number of data that must be had to split (default 2), and the number of features used in
the model.
Trained model can proceed to make predictions:
© Structuralia 166
# Prediction
# Confusion Matrix
Illustration 53 Confusion Matrix
And with this we can visualize the decision boundaries92:
### Visualization
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
ax = ax or plt.gca()
# Plot the points
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3)
ax.axis('tight')
ax.axis('off')
xlim = ax.get_xlim()
92 Display function obtained from: https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-

forests.html
167 © Structuralia
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,
cmap=cmap, clim=(y.min(), y.max()),
zorder=1)
ax.set(xlim=xlim, ylim=ylim)
plt.title('Decision Tree Classifier')
plt.show()
# Train
visualize_classifier(classifier, X_train, y_train)
# Test
visualize_classifier(classifier, X_test, y_test)
© Structuralia 168
Illustration 54 Decision Borders for Train
Illustration 55 Decision Frontiers for Testing
169 © Structuralia
5.6 Example 2: Random forest
The same can be analyzed for a random forest model. The code, analogous to the previous one,
would be in this case:
# Libraries
import numpy as np
import pandas as pd
import math
# Dataset
# Train/Test

random_state = 0)
# Feature Scaling -> Not necessary as it is not based on Euclidean distances
# but can be used to speed up calculations
© Structuralia 170
# Model training
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 500, criterion = 'entropy',
max_leaf_nodes = None, # if you want to

define a maximum of leaves per node
min_samples_split = 2, # minimum number of

data to do a split
max_features = None, # If you want to

define a limit of features used
random_state = 0)
# Prediction
# Confusion Matrix
# Feature importance
importance = classifier.feature_importances_
print(importance)
171 © Structuralia
Figure 56 can see that they improve the results a little compared to the decision tree (2 more data are well classified)
Illustration 57 Decision Borders on Train Data
Figure 161 Decision frontier on test data
One of the radical differences, as seen in the code, is the definition of the number of decision
trees used in the random forest (in this case 500 have been chosen).
© Structuralia 172

Lad01305 Datamining Mlanddl U1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lad01305 Datamining Mlanddl U1

Uploaded by

Copyright:

Available Formats

DATA MINING, MACHINE LEARNING AND DEEP LEARNING

SUPERVISED LEARNING (I)

2. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (I) ......................................................................... 48

2.1 What is supervised learning?............................................................................................................................... 48

3. SIMPLE, MULTIPLE AND LOGISTIC LINEAR REGRESSION (II) ........................................................................ 74

3.1 Logistic Regression ............................................................................................................................................. 74

4. SUPPORT VECTOR MACHINES (SVM) .............................................................................................................. 120

4.1 Description......................................................................................................................................................... 120

5. DECISION TREES................................................................................................................................................. 148

5.1 Initial intuition ..................................................................................................................................................... 148

A classification of the different AI techniques is as follows:

▪ Fuzzy Logic Systems.

Figure 1 Visual classification of the various techniques mentioned2

Figure 4 Historical evolution of technologies5.

Figure 5 Origins of different ML algorithms, specifically for many DL6.

Figure 7 AI/ML applications including DM8.

Figure 8 Intersections between the disciplines mentioned above9.

Figure 9 As we can see, data mining appears as an intersection of different disciplines.10

1.2 Some Machine Learning applications

Detecting people in photos

Figure 9 Example of face tagging on Facebook.

Figure 10 Autocomplete image of texts based on the characters entered.

Figure 11 Keyboard icon to transform a voice-to-text message.

Figure 13 Example of medical imaging.

Figure 15 Example of dialogue with Siri

Information retrieval systems

Figure 16 Google Search Engine

Mail spam filtering

1.3 Machine Learning Categories: Supervised Learning, Unsupervised Learning,

1.3.1. Supervised Learning

Figure 18 Simple example of classification of a supervised algorithm12

Figure 19 Supervised learning functioning scheme13.

1.3.2. Unsupervised Learning

Figure 21 Unsupervised learning functioning scheme14.

1.3.3. Reinforcement Learning

Figure 22 Reinforcement Learning Operation Scheme15.

1.4 Preparing the work environment

Installing Python and Anaconda

The steps to install Anaconda on Windows are as follows:

2. Locate and double-click on the executable and follow the steps.

4. An important step is to decide whether to automatically associate Anaconda with

Figure 24 STANDARD Option

Figure 25 ALTERNATIVE Option

6. Finish the installation.

Optional: Add the Anaconda path to Windows later.

> python --version

> anaconda --version

Figure 26 Windows Terminal Open.

Figure 27 WIndows output if we are not on the path.

Figure 28 Output when the path exists.

> where python

> where conda

Figure 29 Location of the two executables.

Instalación en Ubuntu19 / Mac OS20

Similarly, the installation in this case would be:

2) From the downloads folder, run the file .sh

This results in the following output:

Which should give an output similar to the following:

Optional: Creating Virtual Environments

1) Creating the environment.

$ conda create –-name my_env python=3

2) Activating the environment.