Professional Documents
Culture Documents
Mahdi Louati
3 GLID
September, 19th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models
0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models
01 Data Preprocessing
1.1. Importing the Librairies
1.2. Importing the Dataset
1.3. Missing Data
1.4. Categorical Data
1.5. Training Set and Test Set
1.6. Feature Scaling
How to prepare your dataset so that your future model of Machine Learning will learn in the best
conditions?
Categorical Data
Feature
Encoding the Training Set and Test Scaling
variables Set Put the variables under
Nominal
the same scale
Ordinal
Variable
Variable
Spider temp.py file Open a new file and delete the temp.py file
Save in the same file than the dataset Remove the part containing the date, the
author…
Put a sharp in front of the title of this part (it is a comment: can not be executable)
Pandas is one of the most popular Python Libraries for Data Science. It is the “SQL of Python.”
Why
?
Pandas helps you to manage two-dimensional (or more) data tables in Python.
To import the
Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as
1. Data Preprocessing
dataset=pd.read_csv(‘Data.csv
Insert the name of the data file
’)
dataset = pd.read_excel("test1.xls")
dataset =
Execute this line of code
See the ‘variable explorer’ Country Age Salary Purchased
France 44 72000 No
Double clicks on dataset
Spain 27 48000 Yes
Germany 30 54000 No
We get the dataset
Spain 38 61000 No
Germany 50 83000 No
Predict whether the customer purchses or not the
product ? France 37 67000 Yes
The company has the information to know if the customer has bought or not the product
The company tries to establish correlations between the Country, the Age, the Salary and the
decision to buy or not the product The Business Scenario
Independent Variables are use to predict the client’s decision Predictive variables
Choose all the lines Choose all the columns except the last one
X=dataset.iloc[:,0:2]
X=dataset.iloc[:,0:3]
Create the vector of the Dependent Variabe
y=dataset.iloc[:,3]
y=dataset.iloc[:,-1]
0 France 44 72000 No
2 Germany 30 54000 No
The dataset isn’t Gaussian or has several outliers
3 Spain 38 61000 No
8 Germany 50 83000 No
If the Age is between 30 and 50 with three persons who are 80 years old use the median
Create an object ‘imputer’ of the Imputer class that allows us to replace the missing data by mean
The class Imputer admits some parameters Replace the missing data by the mean of the line
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
2 Germany 30 54000 No
France
Germany 3 Spain 38 61000 No
France 1 0 0
Spain 0 0 1
Germany 0 1 0
DUMMY
Spain 0 0 1
Encoding
Germany 0 1 0
France 1 0 0
Spain 0 0 1
France 1 0 0
Germany 0 1 0
France 1 0 0
France 0
Germany 1 France < Germany < Spain
Spain 2 Country Encode
France 0
Spain 2
This is not good because in our equations there will be this Germany 1
order relation which will be taken into account and this Spain 2
will cause a bias because in reality this relation is false
Germany 1
France 0
France 2
There is no parameters
labelencoder=LabelEncoder()
This object is created for the two variables Country and Purchased and we start by the Country
labelencoder_X=LabelEncoder()
We fit the column that we will transform to recover the informations (as we proceeded for the
object imputer)
We take all the lines and the desired column of the matrix X that is indexed 0 (i.e., X[:,0])
X[:,0] = labelencoder_X.fit_transform(X[:,0])
We use the method fit transform (i.e., at the same time, we fit the object labelencoder_X to the
column Country and then we transform it to zeros and one)
Country Age Salary
from sklearn.preprocessing import LabelEncoder 0 44.0 720000.0
labelEncoder_X=LabelEncoder() 2 27.0 48000.0
X.iloc[:,0]=labelEncoder_X.fit_transform(X.iloc[:,0])
1 30.0 54000.0
2 38.0 61000.0
1 40.0 63777,7778
The class LabelEncoder transforms the text France,
Germany and Spain in the numerical values 0; 1 and 2 0 35.0 58000.0
2 38,7778 52000.0
0 48.0 79000.0
We must consider this step because the class OneHotEncoder 1 50.0 83000.0
can not be used directly on text
0 37.0 67000.0
onehotencoder=OneHotEncoder()
The argument categorical features will actually contain the index of the column that we will
encode as a DUMMY variable. In fact we only have the first column to encode, so we put 0
between crochets
onehotencoder=OneHotEncoder(categorical_features=[0])
Now, we will make the connection between the matrix X and the object onehotencoder and for
this purpose, we take the first column of X (i.e., X[:,0])
We use the method ‘fit_transform’ which fit our object onehotencoder to X by taking only the
first column (since we specified the index 0 of X) and then, we will transform it
We create the three columns one for each country
X=onehotencoder.fit_transform( .toarray()
X)
We add toarray to specify that you
want to turn the result as a table
# Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
ct = ColumnTransformer([("State", OneHotEncoder(), [0])], remainder = 'passthrough’)
X = ct.fit_transform(X)
Country Country Age Salary
France 1.00 0.00 0.00 44.00 72000.00
Spain 0.00 0.00 1.00 27.00 48000.00
Germany 0.00 1.00 0.00 30.00 54000.00
Now, the matrix X is ready to be
Spain 0.00 0.00 1.00 38.00 61000.00 integrated into equations to define
Germany 0.00 1.00 0.00 40.00 63777.77 some models of machine learning
France 1.00 0.00 0.00 35.00 58000.00
Spain 0.00 0.00 1.00 38.77 52000.00
France 1.00 0.00 0.00 48.00 79000.00
Germany 0.00 1.00 0.00 50.00 83000.00
France 1.00 0.00 0.00 37.00 67000.00
We transform the modalities of the DV Y into numerical values using the class labelencoder
Purchased Purchased
labelencoder_y=LabelEncoder() No 0
Yes 1
No 0
No 0
It is a new object created for the Dependent Variable y
Yes 1
Yes 1
No 0
y=labelencoder_y.fit_transform(y) Yes 1
No 0
Yes 1
1. Data Preprocessing
1.5. Splitting the Dataset into Training Set and Test Set
Country Age Salary Purchased
Splitting the dataset into trainig set and test set
0 France 44 72000 No
Why 1 Spain 27 48000 Yes
When, we build a model? of Machine Learning 2 Germany 30 54000 No
CP(training set)
CP(test set)
If CP(training set) and CP(test set) are not similar (CP(training set) > CP(test set) )
There was an over-learning on the training set (i.e., on correlations of the training set)
Data
X_train: The matrix of the Independent X_test: The matrix of the Independent
Variables of the trainning set Variables of the test set
train_test_split
y_train: The vector of the Depenedent y_test: The vector of the Dependent
Variable of the training set Variable of the test set
To obtain these four entities, we write
X_train,X_test,y_train,y_test=train_test_split()
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2
)
from model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
It doesn’t
work!!!
We forgot the Library scikit-learn
We add a last argument, that is random_state to unify the choice of the training set and the test set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
X_train Y_train
Country Age Salary Purchased
Initial Data
0 1 0 40 63777.778 1 Country Age Salary Purchased
1 0 0 37 67000 1 1 0 0 44 72000 No
0 0 1 27 48000 1 0 0 1 27 48000 Yes
0 0 1 38.7778 52000 0 0 1 0 30 54000 No
1 0 0 48 79000 1 0 0 1 38 61000 No
0 0 1 38 61000 0 0 1 0 40 63777.778 Yes
1 0 0 44 72000 0 1 0 0 35 58000 Yes
1 0 0 35 58000 1 0 0 1 38.778 52000 No
X_test Y_test 1 0 0 48 79000 Yes
Country Age Salary Purchased 0 1 0 50 83000 No
0 1 0 30 54000 0 1 0 0 37 67000 Yes
0 1 0 50 83000 0
The training set is composed by the matrix of the Independent Variable X_train and the vector of
Dependent Variable y_train. It comprises 8 observations which represents 80% of the dataset
Our model of Machine Learning will learn the correlations on the training set in order to
have predictions on the behavior of the client (if he will buy the product or not).
we will predict if the two clients composing the test set have bought or not the prduct and if
for example the model can predict that these two clients did not buy the product then we will
have a precision of 100%
Here we propose a small dataset with 10 observations in order to understand how does it work. In
reality, we will work with much larger dataset. We will have more than 8 observations in the
training set and more than 2 observations in the test set.
1. Data Preprocessing
1.6. Feature
Scaling
Country Age Salary
Final step in the Data
Preprocessing 1 0 0 44 72000
It consists to put the variables on the same scale 0 0 1 27 48000
0 1 0 30 54000
Why 0 0 1 38 61000
?
0 1 0 40 63777.778
One variable does not crush the others in the Machine Learning
model 1 0 0 35 58000
In our dataset, we see the IV Age does not have the same scale at all 0 0 1 38.778 52000
as the IV Salary, indeed the Age takes values between 27 and 50 1 0 0 48 79000
and the Salary takes values between 48000 and 83000. 0 1 0 50 83000
1 0 0 37 67000
It is not the same scale.
The Salary can dominate and even crush the variable Age. This latter will not be taken into
account in the model while it may have an impact on the DV Purchased.
Age Salary
44 72000
27 48000
30 54000
38 61000
40 63777.777
8
35 58000
38.777 52000
8
48 79000
50 83000
37 67000
(83000-48000)²=1225 *
(50-27)²=441
For the feature scaling we will use
The Library Scikit Learn The module preprocessing The class StandardScaler
Standarisation rescales data to have a mean Normalisation rescales the values into range
equl to 0 and standard deviation equal to 1 (i.e., [0,1]. This might be useful in some cases where
a reduced centered variable). However, we all valuess needed the same positive scale.
conserve the outliers However the outliers from the data are lost
Standarisation vs
Normalisation
Standarisation Normalisation
sc=StandardScaler(
)
Here we have no arguments to enter, we keep the default values
We must now link our object to what we want to do (i.e., on which we want to apply feature
scaling that is the matrix of Independent Variables X)
In the previous step, we created the training set and the test set ?
The feature scaling will be applied to the matrix of independent variables X_train (i.e., it is on
X_train that our object ‘sc’ will be fited)
We compute the mean and the standard deviation of each independent variable of X_train in
order to use the standarisation method
Our object ‘sc’ will be linked to X_train
Why the
X_test?
The training set and the test set have a similar distribution
Mean(X_train) ≈ Mean(X_test) and Standard Deviation (X_train) ≈ Standard Deviation (X_test)
38.472 40 6.2085308246 10
62597.22 68500 10193.183997162025 14500
np.mean(X_train[:,3:5],axis=0) np.sqrt(np.var(X_train[:,3]))
or or
np.mean(X_train[:,3]) and np.mean(X_train[:,4]) np.std(X_train[:,3]) and np.std(X_train[:,4])
X_train =
0.5 0.125 0.375 38.472 62597.22
Means
Country Age Salary Country Age Salary
Standard
Deviations
0.5 0.33071891388307384 0.4841229182759271 6.2085308246 10193.183997162025
Even the DUMMY variable are scaled
However, you can choose not to scale these variables. Anyway they will be in a scale very close
to that obtained by Standard Scaler
The new scale of the Age and the new scale of the Salary are similar
The values of each variable are almost in the new scale and in range [-1,1]
No risk of dominating one variable relative to the others in the modelss of the ML
For X_test neither fit_transform method nor fit method is used because the training set and the
test set have a similar distribution and our object ‘sc’ has already fited to the trainig set so we
will transform directly the test set using the mean and the standard deviation of the training set
X_test=sc.transform(X_test)
However in the general case if the Dependent Variable Y takes very large values, we must apply
the feature scaling to transform Y under the same scale that the Independent Variables