Data Science Chapitre 1

Data Science
Mahdi Louati
3 GLID
September, 19th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models
0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models
3. Classification 4. Clustering 5. Dimensionality Reduction

Models
3.1. Logistic Regression 4.1. K-Means Clustering 5.1. Principal Component Analysis (PCA)
3.2. K-Nearest Neighbors 4.2. Hierarchical Clustering 5.2. Linear Discriminant Analysis (LDA)
3.3. Support Vector Machine (SVM) 5.3. Kernel PCA
3.4. Kernel SVM
3.5. Naïve Bayes
3.6. Decision Tree classification
6. Reinforcement 7. Natural langage Processing 8. Deep

Learning
6.1. Upper confidence Boundary (UCB) (NLP) Learning
8.1. Artificial Neural Networks
6.2. Thompson Sampling 8.2 Convolution Neural Networks
Section 1
01 Data Preprocessing
1.1. Importing the Librairies
1.2. Importing the Dataset
1.3. Missing Data
1.4. Categorical Data
1.5. Training Set and Test Set
1.6. Feature Scaling
How to prepare your dataset so that your future model of Machine Learning will learn in the best
conditions?
Import the Librairies Import the Dataset Missing

Data
Categorical Data
Feature
Encoding the Training Set and Test Scaling
variables Set Put the variables under
Nominal
the same scale
Ordinal
Variable
Variable
Spider temp.py file Open a new file and delete the temp.py file
Save in the same file than the dataset Remove the part containing the date, the
author…
Put a sharp in front of the title of this part (it is a comment: can not be executable)
# Importing the Librairies

1. Data Preprocessing
1.1. Importing the Librairies

NumPy is the fundamental package for scientific computing with Python. It is an extension of
the Python programming language, intended to manipulate matrices or multidimensional arrays
as well as mathematical functions operating on these arrays.
Matplotlib is a Python Library used to create 2D graphs and plots by using Python scripts. It has

a module named pyplot which makes things easy for plotting by providing feature to control line
styles, font properties, formatting axes etc.
Pandas is one of the most popular Python Libraries for Data Science. It is the “SQL of Python.”
Why
?
Pandas helps you to manage two-dimensional (or more) data tables in Python.
To import the
Libraries
Put a shortcut name of the

Use the command Put the name of the Library
import Library
# Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as
1.2. Importing the dataset

To import the Use the Pandas Library
dataset (use the shortcut name of
this Library: pd)
Use the method of importation

Give a variable
‘read_csv’.
name
If you have an excel file ‘read_excel’
dataset=pd.read_csv(‘Data.csv
Insert the name of the data file
’)
dataset = pd.read_excel("test1.xls")
dataset =
Execute this line of code
See the ‘variable explorer’ Country Age Salary Purchased
France 44 72000 No
Double clicks on dataset
Spain 27 48000 Yes
Germany 30 54000 No
We get the dataset
Spain 38 61000 No
Germany 40 ------ Yes

Three Independent Variables (Country, Age and
Salary) France 35 58000 Yes
Spain ---- 52000 No

One Dependent Variable (Purchased)
France 48 79000 Yes
Germany 50 83000 No
Predict whether the customer purchses or not the
product ? France 37 67000 Yes
The company has the information to know if the customer has bought or not the product
The company tries to establish correlations between the Country, the Age, the Salary and the
decision to buy or not the product The Business Scenario
Independent Variables are use to predict the client’s decision Predictive variables
Dependent Variable: The decision of the Variable to predict

customer
Create the matrix of the Independent Variables
It contains the values of the three first variables with all lines
Use a useful technique of Pandas that is ‘iloc’
Recover all the indices of the first three columns
Precise the indices that we will recover
Recover the observation lines: the indices of the lines of the dataset
Choose the observation columns: the indices of the columns of the dataset
X=dataset.iloc[:,:-1]
Choose all the lines Choose all the columns except the last one
X=dataset.iloc[:,0:2]
X=dataset.iloc[:,0:3]
Create the vector of the Dependent Variabe
It contains the responses of the customers ‘Yes’ or ‘No’ of each observation
y=dataset.iloc[:,3]
y=dataset.iloc[:,-1]
Take the values of

If Y is a numerical variable
rows and columns
y=dataset.iloc[:,-1].values
1.3. Missing
values
Two missing values
The dataset is Gaussian distributed without

outliers
Country Age Salary Purchased
0 France 44 72000 No
The mean of the desired column 1 Spain 27 48000 Yes
2 Germany 30 54000 No
The dataset isn’t Gaussian or has several outliers
3 Spain 38 61000 No
4 Germany 40 ------ Yes
5 France 35 58000 Yes

The median of the desired column
6 Spain ----- 52000 No
(The great outliers will cause biais in your data)
The mean is not significant and irrevelant 7 France 48 79000 Yes
In general case, we use the median 9 France 37 67000 Yes

Age between 27 and 50 Normal distribution Replace the MD by the mean
MD =38,777778
If the Age is between 30 and 50 with three persons who are 80 years old use the median
Salary between 48000 € and 83000 €

1 Spain 27 48000 Yes
No outliers and Gaussian distributed 2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 ------ Yes
63777,778
Replace this MD by the mean of the salary column 5 France 35 58000 Yes
6 Spain ----
38,778 52000 No
MD(Salary) == 63777,77778
The Library Scikit- The module preprocessing The class Imputer
learn
from sklearn.preprocessing import Imputer
Create an object ‘imputer’ of the Imputer class that allows us to replace the missing data by mean
The class Imputer admits some parameters Replace the missing data by the mean of the line
missing_values=NaN strategy=mean axis = 0 or 1
Replace the missing data by the mean of the column
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
We fit the object ‘imputer’ to X #Count the number of missing values for each column
imputer.fit(X.iloc[:,[1,2]]) print(dataset.isnull().sum())
Take all the lines Countr Age Salary
y
Choose the indices of the columns
that contain the missing data 0 France 44 72000
imputer.fit(X.iloc[:,1:3]) 1 Spain 27 48000
We use the method ‘transform’ 2 German 30 54000

y
X.iloc[:,1:3]=imputer.transform(X.iloc[:,1: 3 Spain 38 61000
3])
#Taking care of missing values 4 German 40 63777.77
from sklearn.preprocessing import Imputer y 8
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0) 5 France 35 58000

from sklearn.impute import SimpleImputer 6 Spain 38.778 52000
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
7 France 48 79000
imputer=imputer.fit(X.iloc[:,1:3])
X.iloc[:,1:3]=imputer.transform(X.iloc[:,1:3]) 8 German 50 83000
y
1.4. Categorical
data
Country Age Salary Purchase
d
The modalitiess of the variable
Country are not numerical 1 Spain 27 48000 Yes
France
Germany 3 Spain 38 61000 No
Spain 4 Germany 40 63777.778 Yes

Country is a Categorical variable
6 Spain 38.778 52000 No
The variable Purchased is also 7 France 48 79000 Yes
Categorical with two categories
Yes 9 France 37 67000 Yes

No
The variables Age and Salary are not Categorical

Categorical variables
We have to manage the Categorical variables
The Machine learning models are based on mathematical

equations
If we keep the texts France, Germany, Spain, Yes and
No
We have a problem to implement that in mathematical
equations
Encode the Categorical variable Country written as text in numeric
values
Encode the Categorical variable Purchased, for example No becomes 0 and Yes
becomes 1
Country is nominal: There is no order between the categories (France, Germany and
Spain)
Encode this variable by the method of the DUMMY
variable
In statistics and particularly in regression analysis, a dummy variable (an indicator variable,
Boolean indicator or binary variable) is one that takes the value 0 or 1 to indicate the absence or
the presence of some categorical effect.
There is no order relationship between the modalities of the variable to be
encoded
Country France Germany Spain
France 1 0 0
Spain 0 0 1
Germany 0 1 0
DUMMY
Spain 0 0 1
Encoding
Germany 0 1 0
France 1 0 0
Spain 0 0 1
France 1 0 0
Germany 0 1 0
France 1 0 0
France 0
Germany 1 France < Germany < Spain
Spain 2 Country Encode
France 0
Spain 2
This is not good because in our equations there will be this Germany 1
order relation which will be taken into account and this Spain 2
will cause a bias because in reality this relation is false
Germany 1
France 0
We create three columns (one column for each country) Spain 2

with one and zeros as values. This is the DUMMY France 0
variable or the ‘One hot encoding‘
Germany 1
France 2
For the variable Purchased we encode directly (Yes 1 and No 0)

We import the classes ‘LabelEncoder’ and ‘OneHotEncoder’ from the Library scikit-learn from
the module preprocessing in the same time
The Library The module The classes:
Scikit-learn preprocessing LabelEncoder and
OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
We create an object of each class and we start by the LabelEncoder one
There is no parameters
labelencoder=LabelEncoder()
This object is created for the two variables Country and Purchased and we start by the Country
labelencoder_X=LabelEncoder()
We fit the column that we will transform to recover the informations (as we proceeded for the
object imputer)
We take all the lines and the desired column of the matrix X that is indexed 0 (i.e., X[:,0])
X[:,0] = labelencoder_X.fit_transform(X[:,0])
In fit_transform, we put the column of X that we will encode
We use the method fit transform (i.e., at the same time, we fit the object labelencoder_X to the
column Country and then we transform it to zeros and one)
Country Age Salary
from sklearn.preprocessing import LabelEncoder 0 44.0 720000.0
labelEncoder_X=LabelEncoder() 2 27.0 48000.0
X.iloc[:,0]=labelEncoder_X.fit_transform(X.iloc[:,0])
1 30.0 54000.0
2 38.0 61000.0
1 40.0 63777,7778
The class LabelEncoder transforms the text France,
Germany and Spain in the numerical values 0; 1 and 2 0 35.0 58000.0
2 38,7778 52000.0
0 48.0 79000.0
We must consider this step because the class OneHotEncoder 1 50.0 83000.0
can not be used directly on text
0 37.0 67000.0
Now, we encode the column Country using the class

OneHotEncoder
The variable Purchased is a DV, we encode directly: we don’t use the object onehotencoder
we don’t specify X in the object onehotencoder
onehotencoder=OneHotEncoder()
We need to add the argument ‘categorical_features’
The argument categorical features will actually contain the index of the column that we will
encode as a DUMMY variable. In fact we only have the first column to encode, so we put 0
between crochets
onehotencoder=OneHotEncoder(categorical_features=[0])
Now, we will make the connection between the matrix X and the object onehotencoder and for
this purpose, we take the first column of X (i.e., X[:,0])
We use the method ‘fit_transform’ which fit our object onehotencoder to X by taking only the
first column (since we specified the index 0 of X) and then, we will transform it
We create the three columns one for each country
X=onehotencoder.fit_transform( .toarray()
X)
We add toarray to specify that you
want to turn the result as a table
# Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
ct = ColumnTransformer([("State", OneHotEncoder(), [0])], remainder = 'passthrough’)
X = ct.fit_transform(X)
Country Country Age Salary
France 1.00 0.00 0.00 44.00 72000.00
Spain 0.00 0.00 1.00 27.00 48000.00
Germany 0.00 1.00 0.00 30.00 54000.00
Now, the matrix X is ready to be
Spain 0.00 0.00 1.00 38.00 61000.00 integrated into equations to define
Germany 0.00 1.00 0.00 40.00 63777.77 some models of machine learning
France 1.00 0.00 0.00 35.00 58000.00
Spain 0.00 0.00 1.00 38.77 52000.00
France 1.00 0.00 0.00 48.00 79000.00
Germany 0.00 1.00 0.00 50.00 83000.00
France 1.00 0.00 0.00 37.00 67000.00
We transform the modalities of the DV Y into numerical values using the class labelencoder
Purchased Purchased
labelencoder_y=LabelEncoder() No 0
Yes 1
No 0
No 0
It is a new object created for the Dependent Variable y
Yes 1
Yes 1
No 0
y=labelencoder_y.fit_transform(y) Yes 1
No 0
Yes 1
1.5. Splitting the Dataset into Training Set and Test Set
Splitting the dataset into trainig set and test set
Why 1 Spain 27 48000 Yes
When, we build a model? of Machine Learning 2 Germany 30 54000 No
 The model will learn correlations between the 3 Spain 38 61000 No

Independent Variables and the Dependent 4 Germany 40 63777.78 Yes
Variable.
 This will be done in a subset of the original dataset 5 France 35 58000 Yes
 It represents 70% to 80%
6 Spain 38.78 52000 No
 That is the training set
 To verify that there is not over-learning (it didn’t
memorize the correlations), we must have a test set. 8 Germany 50 83000 No

 It will be less than the training set
 It represents 20% to 30% of the original
 The model must be tested on new observations (i.e., observations outside the
correlations that the model has learned) How does it
 we build our model of ML on the training set (it learns the correlations)
 The model will make predictions on the same observations of training set (i.e., for each
observation of the training set, it will predict if the client corresponding to the observation
buys Yes or No the product so it will predict 1 or 0.
 We count the number of errors committed by the model.
 we divide the number of errors by the total number of observations of the training set which
gives the Precision Coefficient (CP) of the training set.
CP(training set)
 We establish new predictions on the test set

 We calculate the Precision Coefficient (CP) of the model of the test set
CP(test set)
 Finally we compare the two precisions

 If CP(training set) and CP(test set) are similar (very
close)
There was a good training of the model i.e., the model established good correlations
The model is perfect
 If CP(training set) and CP(test set) are not similar (CP(training set) > CP(test set) )
There was an over-learning on the training set (i.e., on correlations of the training set)
The model is unable to predict correctly new

observations
The model didn’t learn automatically the

correlations
We import the class ‘train_test_split’ from the model
‘model_selection’
from model_selection import
train_test_split
In fact, train_test_split isn’t a class. It is a function that’s why we are not going to create objects
Data
X_train: The matrix of the Independent X_test: The matrix of the Independent
Variables of the trainning set Variables of the test set
train_test_split
y_train: The vector of the Depenedent y_test: The vector of the Dependent
Variable of the training set Variable of the test set
To obtain these four entities, we write
X_train,X_test,y_train,y_test=train_test_split()
We put the arguments of the function train_test_split

 The matrix of the Independent Variables X
 The vector of the Dependent Variable y
 train_size or test_size (the proportion of observations of the dataset
that we want to put in the test set in our case we choose 20%)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2
)
from model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
It doesn’t
work!!!
We forgot the Library scikit-learn
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.
2)
We add a last argument, that is random_state to unify the choice of the training set and the test set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
X_train Y_train
Initial Data
0 1 0 40 63777.778 1 Country Age Salary Purchased
1 0 0 37 67000 1 1 0 0 44 72000 No
0 0 1 27 48000 1 0 0 1 27 48000 Yes
0 0 1 38.7778 52000 0 0 1 0 30 54000 No
1 0 0 48 79000 1 0 0 1 38 61000 No
0 0 1 38 61000 0 0 1 0 40 63777.778 Yes
1 0 0 44 72000 0 1 0 0 35 58000 Yes
1 0 0 35 58000 1 0 0 1 38.778 52000 No
X_test Y_test 1 0 0 48 79000 Yes
Country Age Salary Purchased 0 1 0 50 83000 No
0 1 0 30 54000 0 1 0 0 37 67000 Yes
0 1 0 50 83000 0
The training set is composed by the matrix of the Independent Variable X_train and the vector of
Dependent Variable y_train. It comprises 8 observations which represents 80% of the dataset
 Our model of Machine Learning will learn the correlations on the training set in order to
have predictions on the behavior of the client (if he will buy the product or not).
 We will test if they are good correlations on the test

set
The test set is composed by the matrix of the Independent Variable X_test and the vector of
Dependent Variable y_test. It comprises 2 observations which represents 20% of the dataset
 we will predict if the two clients composing the test set have bought or not the prduct and if
for example the model can predict that these two clients did not buy the product then we will
have a precision of 100%
Here we propose a small dataset with 10 observations in order to understand how does it work. In
reality, we will work with much larger dataset. We will have more than 8 observations in the
training set and more than 2 observations in the test set.
1.6. Feature
Scaling
Country Age Salary
Final step in the Data
Preprocessing 1 0 0 44 72000
It consists to put the variables on the same scale 0 0 1 27 48000
0 1 0 30 54000
Why 0 0 1 38 61000
?
0 1 0 40 63777.778
One variable does not crush the others in the Machine Learning
model 1 0 0 35 58000
In our dataset, we see the IV Age does not have the same scale at all 0 0 1 38.778 52000
as the IV Salary, indeed the Age takes values between 27 and 50 1 0 0 48 79000
and the Salary takes values between 48000 and 83000. 0 1 0 50 83000
1 0 0 37 67000
It is not the same scale.
The Salary can dominate and even crush the variable Age. This latter will not be taken into
account in the model while it may have an impact on the DV Purchased.
Age Salary
44 72000
27 48000
30 54000
38 61000
40 63777.777
8
35 58000
38.777 52000
8
48 79000
50 83000
37 67000
(83000-48000)²=1225 *
(50-27)²=441
For the feature scaling we will use
The Library Scikit Learn The module preprocessing The class StandardScaler
from sklearn.preprocessing import StandardScaler
Standarisation rescales data to have a mean Normalisation rescales the values into range
equl to 0 and standard deviation equal to 1 (i.e., [0,1]. This might be useful in some cases where
a reduced centered variable). However, we all valuess needed the same positive scale.
conserve the outliers However the outliers from the data are lost
Standarisation vs
Normalisation
Standarisation Normalisation
Range [-1;1] Range [0;1]
some values outside …….
Normal histogram Uniform histogram
Conserve the outliers Lose the outliers
For most applications standarisation is recommended
How to create an histogram

import matplotlib.pyplot as plt
plt.hist(Y)
plt.title(‘Histogram’,fontsize=10)
plt.show()
Now, we create an object of the class StandardScaler that we note ‘sc’
sc=StandardScaler(
)
Here we have no arguments to enter, we keep the default values
We must now link our object to what we want to do (i.e., on which we want to apply feature
scaling that is the matrix of Independent Variables X)
In the previous step, we created the training set and the test set ?
The feature scaling will be applied to the matrix of independent variables X_train (i.e., it is on
X_train that our object ‘sc’ will be fited)
We compute the mean and the standard deviation of each independent variable of X_train in
order to use the standarisation method
Our object ‘sc’ will be linked to X_train
We apply the transformation method to X_train and to x_test
Why the
X_test?
The training set and the test set have a similar distribution
Mean(X_train) ≈ Mean(X_test) and Standard Deviation (X_train) ≈ Standard Deviation (X_test)
38.472 40 6.2085308246 10
62597.22 68500 10193.183997162025 14500
np.mean(X_train[:,3:5],axis=0) np.sqrt(np.var(X_train[:,3]))
or or
np.mean(X_train[:,3]) and np.mean(X_train[:,4]) np.std(X_train[:,3]) and np.std(X_train[:,4])
X_train =
0.5 0.125 0.375 38.472 62597.22
Means
Country Age Salary Country Age Salary
0 1 0 40 63777.778 -1 2.64575 -0.774597 0.263068 0.123815
1 0 0 37 67000 1 -0.377964 -0.774597 -0.253501 0.461756
0 0 1 27 48000 -1 -0.377964 1.29099 -1.9754 -1.53093
0 0 1 38.778 52000 -1 -0.377964 1.29099 0.0526135 -1.11142
1 0 0 48 79000 1 -0.377964 -0.774597 1.64059 1.7203
0 0 1 38 61000 -1 -0.377964 1.29099 -0.081311 -0.167514
1 0 0 44 72000 1 0.377964 -0.774597 0.951826 0.986148
1 0 0 35 58000 1 -0.377964 -0.774597 -0.597881 -0.482149
Standard
Deviations
0.5 0.33071891388307384 0.4841229182759271 6.2085308246 10193.183997162025
Even the DUMMY variable are scaled
However, you can choose not to scale these variables. Anyway they will be in a scale very close
to that obtained by Standard Scaler
The new scale of the Age and the new scale of the Salary are similar
The values of each variable are almost in the new scale and in range [-1,1]
No risk of dominating one variable relative to the others in the modelss of the ML
For X_test neither fit_transform method nor fit method is used because the training set and the
test set have a similar distribution and our object ‘sc’ has already fited to the trainig set so we
will transform directly the test set using the mean and the standard deviation of the training set
X_test=sc.transform(X_test)
0 1 0 40 68500 Country Age Salary
Means -1 2.64575 -0.774597 -1.45883 -0.901663
ning -1 2.64575 -0.774597 1.98496 2.13981

Tr ai
Country Age Salary
0 1 0 30 54000 Country Age Salary

0 1 0 50 83000 .. 0 .. -1 -1
Test
.. 0 .. 1 1
Standard
Deviations
0 1 0 10 14500 All is under the same scale
we did not apply feature scaling to the Dependent Variable Y because it takes the values 0 and 1
which are already under the same scale as those of the independent variable.
However in the general case if the Dependent Variable Y takes very large values, we must apply
the feature scaling to transform Y under the same scale that the Independent Variables
Our data is more and more ready for modeling

Data Preprocessing
Template
1) Importing the Libraries 2) Importing the dataset
import numpy as np dataset=pd.read_csv('Data.csv')
import matplotlib.pyplot as plt X=dataset.iloc[:,:-1].values
import pandas as pd Y=dataset.iloc[:,3].values
3) Taking Care of Missing Data
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=
0) 5) Splitting the dataset into Training Set and Test Set
imputer=imputer.fit(X[:,1:3]) from sklearn.model_selection import train_test_split
X[:,1:3]=imputer.transform(X[:,1:3])
4) Encoding the Categorical Data X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0
.2)
from sklearn.preprocessing import LabelEncoder,
OneHotEncoder
labelEncoder_X=LabelEncoder() 6) Feature Scaling
X[:,0]=labelEncoder_X.fit_transform(X[:,0]) from sklearn.preprocessing import
onehotencoder=OneHotEncoder(categorical_features=[0]) StandardScaler
X=onehotencoder.fit_transform(X).toarray() sc=StandardScaler()
labelencoder_Y=LabelEncoder() X_train=sc.fit_transform(X_train)
from sklearn.compose import ColumnTransformer
Y=labelencoder_Y.fit_transform(Y) X_test=sc.fit_transform(X_test)
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder =
'passthrough’)
Training Set
Country Age Sal Pur
Count Ag Sal Pu -1 2.645 -0.774597 0.2638 0.12381 1

e r
1 -0.377 -0.774597 -0.253 0.46175 1
-1 -0.377 1.29099 -1.975 -1.5309 1
1 Spain 27 48000 Ye
s -1 -0.377 1.29099 0.0526 -1.1114 0
2 German 30 54000 No
1 -0.377 -0.774597 1.6405 1.7203 1
y
3 Spain 38 61000 No -1 -0.377 1.29099 -0.081 -0.1675 0
4 German 40 ----- Ye 1 0.377 -0.774597 0.9518 0.98614 0

y s
1 -0.377 -0.774597 -0.597 -0.4821 1
5 France 35 58000 Ye
s
Test Set
6 Spain --- 52000 No
Country Age Sal Pur
7 France 48 79000 Ye
-1 2.645 -0.77459 -1.45883 -0.9016 0
s
8 German 50 83000 No -1 2.645 -0.77459 2.1398 0

Data Science Chapitre 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Chapitre 1

Uploaded by

Copyright:

Available Formats

Data Science

3. Classification 4. Clustering 5. Dimensionality Reduction

6. Reinforcement 7. Natural langage Processing 8. Deep

Import the Librairies Import the Dataset Missing

# Importing the Librairies

1.1. Importing the Librairies

Matplotlib is a Python Library used to create 2D graphs and plots by using Python scripts. It has

Put a shortcut name of the

# Import the Libraries

1.2. Importing the dataset

Use the method of importation

Germany 40 ------ Yes

Spain ---- 52000 No

Dependent Variable: The decision of the Variable to predict

It contains the responses of the customers ‘Yes’ or ‘No’ of each observation

Take the values of

The dataset is Gaussian distributed without

The mean of the desired column 1 Spain 27 48000 Yes

4 Germany 40 ------ Yes

5 France 35 58000 Yes

In general case, we use the median 9 France 37 67000 Yes

Salary between 48000 € and 83000 €

missing_values=NaN strategy=mean axis = 0 or 1

Replace the missing data by the mean of the column

from sklearn.impute import SimpleImputer

imputer.fit(X.iloc[:,1:3]) 1 Spain 27 48000

We use the method ‘transform’ 2 German 30 54000

imputer=Imputer(missing_values='NaN',strategy='mean',axis=0) 5 France 35 58000

Spain 4 Germany 40 63777.778 Yes

5 France 35 58000 Yes

Yes 9 France 37 67000 Yes

The variables Age and Salary are not Categorical

The Machine learning models are based on mathematical

Country France Germany Spain

We create three columns (one column for each country) Spain 2

For the variable Purchased we encode directly (Yes 1 and No 0)

We create an object of each class and we start by the LabelEncoder one

In fit_transform, we put the column of X that we will encode

Now, we encode the column Country using the class

We need to add the argument ‘categorical_features’

 The model will learn correlations between the 3 Spain 38 61000 No

9 France 37 67000 Yes

 We establish new predictions on the test set

 Finally we compare the two precisions

The model is perfect

The model is unable to predict correctly new

The model didn’t learn automatically the

We put the arguments of the function train_test_split

from sklearn.model_selection import train_test_split

 We will test if they are good correlations on the test

from sklearn.preprocessing import StandardScaler

Range [-1;1] Range [0;1]

some values outside …….

Normal histogram Uniform histogram

Conserve the outliers Lose the outliers

For most applications standarisation is recommended

How to create an histogram

We apply the transformation method to X_train and to x_test

0 1 0 40 63777.778 -1 2.64575 -0.774597 0.263068 0.123815

1 0 0 37 67000 1 -0.377964 -0.774597 -0.253501 0.461756

0 0 1 27 48000 -1 -0.377964 1.29099 -1.9754 -1.53093

0 0 1 38.778 52000 -1 -0.377964 1.29099 0.0526135 -1.11142

1 0 0 48 79000 1 -0.377964 -0.774597 1.64059 1.7203

0 0 1 38 61000 -1 -0.377964 1.29099 -0.081311 -0.167514