You are on page 1of 54

Data Science

Mahdi Louati
3 GLID
September, 19th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models

0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models

3. Classification 4. Clustering 5. Dimensionality Reduction


Models
3.1. Logistic Regression 4.1. K-Means Clustering 5.1. Principal Component Analysis (PCA)
3.2. K-Nearest Neighbors 4.2. Hierarchical Clustering 5.2. Linear Discriminant Analysis (LDA)
3.3. Support Vector Machine (SVM) 5.3. Kernel PCA
3.4. Kernel SVM
3.5. Naïve Bayes
3.6. Decision Tree classification

6. Reinforcement 7. Natural langage Processing 8. Deep


Learning
6.1. Upper confidence Boundary (UCB) (NLP) Learning
8.1. Artificial Neural Networks
6.2. Thompson Sampling 8.2 Convolution Neural Networks
Section 1

01 Data Preprocessing
1.1. Importing the Librairies
1.2. Importing the Dataset
1.3. Missing Data
1.4. Categorical Data
1.5. Training Set and Test Set
1.6. Feature Scaling
How to prepare your dataset so that your future model of Machine Learning will learn in the best
conditions?

Import the Librairies Import the Dataset Missing


Data

Categorical Data
Feature
Encoding the Training Set and Test Scaling
variables Set Put the variables under
Nominal
the same scale
Ordinal
Variable
Variable
Spider temp.py file Open a new file and delete the temp.py file

Save in the same file than the dataset Remove the part containing the date, the
author…
Put a sharp in front of the title of this part (it is a comment: can not be executable)

# Importing the Librairies


1. Data Preprocessing

1.1. Importing the Librairies


NumPy is the fundamental package for scientific computing with Python. It is an extension of
the Python programming language, intended to manipulate matrices or multidimensional arrays
as well as mathematical functions operating on these arrays.

Matplotlib is a Python Library used to create 2D graphs and plots by using Python scripts. It has


a module named pyplot which makes things easy for plotting by providing feature to control line
styles, font properties, formatting axes etc.

Pandas is one of the most popular Python Libraries for Data Science. It is the “SQL of Python.”

Why
?

Pandas helps you to manage two-dimensional (or more) data tables in Python.
To import the
Libraries

Put a shortcut name of the


Use the command Put the name of the Library
import Library

# Import the Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as
1. Data Preprocessing

1.2. Importing the dataset


To import the Use the Pandas Library
dataset (use the shortcut name of
this Library: pd)

Use the method of importation


Give a variable
‘read_csv’.
name
If you have an excel file ‘read_excel’

dataset=pd.read_csv(‘Data.csv
Insert the name of the data file
’)
dataset = pd.read_excel("test1.xls")
dataset =
Execute this line of code
See the ‘variable explorer’ Country Age Salary Purchased

France 44 72000 No
Double clicks on dataset
Spain 27 48000 Yes

Germany 30 54000 No
We get the dataset
Spain 38 61000 No

Germany 40 ------ Yes


Three Independent Variables (Country, Age and
Salary) France 35 58000 Yes

Spain ---- 52000 No


One Dependent Variable (Purchased)
France 48 79000 Yes

Germany 50 83000 No
Predict whether the customer purchses or not the
product ? France 37 67000 Yes
The company has the information to know if the customer has bought or not the product

The company tries to establish correlations between the Country, the Age, the Salary and the
decision to buy or not the product The Business Scenario

Independent Variables are use to predict the client’s decision Predictive variables

Dependent Variable: The decision of the Variable to predict


customer
Create the matrix of the Independent Variables
It contains the values of the three first variables with all lines
Use a useful technique of Pandas that is ‘iloc’
Recover all the indices of the first three columns
Precise the indices that we will recover
Recover the observation lines: the indices of the lines of the dataset
Choose the observation columns: the indices of the columns of the dataset
X=dataset.iloc[:,:-1]

Choose all the lines Choose all the columns except the last one

X=dataset.iloc[:,0:2]
X=dataset.iloc[:,0:3]
Create the vector of the Dependent Variabe

It contains the responses of the customers ‘Yes’ or ‘No’ of each observation

y=dataset.iloc[:,3]

y=dataset.iloc[:,-1]

Take the values of


If Y is a numerical variable
rows and columns
y=dataset.iloc[:,-1].values
1. Data Preprocessing
1.3. Missing
values
Two missing values

The dataset is Gaussian distributed without


outliers
Country Age Salary Purchased

0 France 44 72000 No

The mean of the desired column 1 Spain 27 48000 Yes

2 Germany 30 54000 No
The dataset isn’t Gaussian or has several outliers
3 Spain 38 61000 No

4 Germany 40 ------ Yes

5 France 35 58000 Yes


The median of the desired column
6 Spain ----- 52000 No
(The great outliers will cause biais in your data)
The mean is not significant and irrevelant 7 France 48 79000 Yes

8 Germany 50 83000 No

In general case, we use the median 9 France 37 67000 Yes


Age between 27 and 50 Normal distribution Replace the MD by the mean
MD =38,777778

If the Age is between 30 and 50 with three persons who are 80 years old use the median

Salary between 48000 € and 83000 €


Country Age Salary Purchased
0 France 44 72000 No
1 Spain 27 48000 Yes
No outliers and Gaussian distributed 2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 ------ Yes
63777,778
Replace this MD by the mean of the salary column 5 France 35 58000 Yes
6 Spain ----
38,778 52000 No
7 France 48 79000 Yes
MD(Salary) == 63777,77778
8 Germany 50 83000 No
9 France 37 67000 Yes
The Library Scikit- The module preprocessing The class Imputer
learn
from sklearn.preprocessing import Imputer

Create an object ‘imputer’ of the Imputer class that allows us to replace the missing data by mean

The class Imputer admits some parameters Replace the missing data by the mean of the line

missing_values=NaN strategy=mean axis = 0 or 1

Replace the missing data by the mean of the column

imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
We fit the object ‘imputer’ to X #Count the number of missing values for each column
imputer.fit(X.iloc[:,[1,2]]) print(dataset.isnull().sum())
Take all the lines Countr Age Salary
y
Choose the indices of the columns
that contain the missing data 0 France 44 72000

imputer.fit(X.iloc[:,1:3]) 1 Spain 27 48000

We use the method ‘transform’ 2 German 30 54000


y
X.iloc[:,1:3]=imputer.transform(X.iloc[:,1: 3 Spain 38 61000
3])
#Taking care of missing values 4 German 40 63777.77
from sklearn.preprocessing import Imputer y 8

imputer=Imputer(missing_values='NaN',strategy='mean',axis=0) 5 France 35 58000


from sklearn.impute import SimpleImputer 6 Spain 38.778 52000
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
7 France 48 79000
imputer=imputer.fit(X.iloc[:,1:3])
X.iloc[:,1:3]=imputer.transform(X.iloc[:,1:3]) 8 German 50 83000
y
1. Data Preprocessing
1.4. Categorical
data
Country Age Salary Purchase
d
0 France 44 72000 No
The modalitiess of the variable
Country are not numerical 1 Spain 27 48000 Yes

2 Germany 30 54000 No
France
Germany 3 Spain 38 61000 No

Spain 4 Germany 40 63777.778 Yes

5 France 35 58000 Yes


Country is a Categorical variable
6 Spain 38.778 52000 No
The variable Purchased is also 7 France 48 79000 Yes
Categorical with two categories
8 Germany 50 83000 No

Yes 9 France 37 67000 Yes


No

The variables Age and Salary are not Categorical


Categorical variables
We have to manage the Categorical variables

The Machine learning models are based on mathematical


equations
If we keep the texts France, Germany, Spain, Yes and
No
We have a problem to implement that in mathematical
equations
Encode the Categorical variable Country written as text in numeric
values
Encode the Categorical variable Purchased, for example No becomes 0 and Yes
becomes 1
Country is nominal: There is no order between the categories (France, Germany and
Spain)
Encode this variable by the method of the DUMMY
variable
In statistics and particularly in regression analysis, a dummy variable (an indicator variable,
Boolean indicator or binary variable) is one that takes the value 0 or 1 to indicate the absence or
the presence of some categorical effect.
There is no order relationship between the modalities of the variable to be
encoded

Country France Germany Spain

France 1 0 0

Spain 0 0 1

Germany 0 1 0
DUMMY
Spain 0 0 1
Encoding
Germany 0 1 0

France 1 0 0

Spain 0 0 1

France 1 0 0

Germany 0 1 0

France 1 0 0
France 0
Germany 1 France < Germany < Spain
Spain 2 Country Encode

France 0

Spain 2

This is not good because in our equations there will be this Germany 1
order relation which will be taken into account and this Spain 2
will cause a bias because in reality this relation is false
Germany 1

France 0

We create three columns (one column for each country) Spain 2


with one and zeros as values. This is the DUMMY France 0
variable or the ‘One hot encoding‘
Germany 1

France 2

For the variable Purchased we encode directly (Yes 1 and No 0)


We import the classes ‘LabelEncoder’ and ‘OneHotEncoder’ from the Library scikit-learn from
the module preprocessing in the same time
The Library The module The classes:
Scikit-learn preprocessing LabelEncoder and
OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

We create an object of each class and we start by the LabelEncoder one

There is no parameters
labelencoder=LabelEncoder()

This object is created for the two variables Country and Purchased and we start by the Country

labelencoder_X=LabelEncoder()
We fit the column that we will transform to recover the informations (as we proceeded for the
object imputer)

We take all the lines and the desired column of the matrix X that is indexed 0 (i.e., X[:,0])

X[:,0] = labelencoder_X.fit_transform(X[:,0])

In fit_transform, we put the column of X that we will encode

We use the method fit transform (i.e., at the same time, we fit the object labelencoder_X to the
column Country and then we transform it to zeros and one)
Country Age Salary
from sklearn.preprocessing import LabelEncoder 0 44.0 720000.0
labelEncoder_X=LabelEncoder() 2 27.0 48000.0
X.iloc[:,0]=labelEncoder_X.fit_transform(X.iloc[:,0])
1 30.0 54000.0

2 38.0 61000.0

1 40.0 63777,7778
The class LabelEncoder transforms the text France,
Germany and Spain in the numerical values 0; 1 and 2 0 35.0 58000.0
2 38,7778 52000.0

0 48.0 79000.0
We must consider this step because the class OneHotEncoder 1 50.0 83000.0
can not be used directly on text
0 37.0 67000.0

Now, we encode the column Country using the class


OneHotEncoder
The variable Purchased is a DV, we encode directly: we don’t use the object onehotencoder
we don’t specify X in the object onehotencoder

onehotencoder=OneHotEncoder()

We need to add the argument ‘categorical_features’

The argument categorical features will actually contain the index of the column that we will
encode as a DUMMY variable. In fact we only have the first column to encode, so we put 0
between crochets
onehotencoder=OneHotEncoder(categorical_features=[0])

Now, we will make the connection between the matrix X and the object onehotencoder and for
this purpose, we take the first column of X (i.e., X[:,0])
We use the method ‘fit_transform’ which fit our object onehotencoder to X by taking only the
first column (since we specified the index 0 of X) and then, we will transform it
We create the three columns one for each country

X=onehotencoder.fit_transform( .toarray()
X)
We add toarray to specify that you
want to turn the result as a table
# Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
ct = ColumnTransformer([("State", OneHotEncoder(), [0])], remainder = 'passthrough’)
X = ct.fit_transform(X)
Country Country Age Salary
France 1.00 0.00 0.00 44.00 72000.00
Spain 0.00 0.00 1.00 27.00 48000.00
Germany 0.00 1.00 0.00 30.00 54000.00
Now, the matrix X is ready to be
Spain 0.00 0.00 1.00 38.00 61000.00 integrated into equations to define
Germany 0.00 1.00 0.00 40.00 63777.77 some models of machine learning
France 1.00 0.00 0.00 35.00 58000.00
Spain 0.00 0.00 1.00 38.77 52000.00
France 1.00 0.00 0.00 48.00 79000.00
Germany 0.00 1.00 0.00 50.00 83000.00
France 1.00 0.00 0.00 37.00 67000.00
We transform the modalities of the DV Y into numerical values using the class labelencoder

Purchased Purchased

labelencoder_y=LabelEncoder() No 0

Yes 1

No 0

No 0
It is a new object created for the Dependent Variable y
Yes 1

Yes 1

No 0
y=labelencoder_y.fit_transform(y) Yes 1

No 0

Yes 1
1. Data Preprocessing

1.5. Splitting the Dataset into Training Set and Test Set
Country Age Salary Purchased
Splitting the dataset into trainig set and test set
0 France 44 72000 No
Why 1 Spain 27 48000 Yes
When, we build a model? of Machine Learning 2 Germany 30 54000 No

 The model will learn correlations between the 3 Spain 38 61000 No


Independent Variables and the Dependent 4 Germany 40 63777.78 Yes
Variable.
 This will be done in a subset of the original dataset 5 France 35 58000 Yes
 It represents 70% to 80%
6 Spain 38.78 52000 No
 That is the training set
7 France 48 79000 Yes
 To verify that there is not over-learning (it didn’t
memorize the correlations), we must have a test set. 8 Germany 50 83000 No

9 France 37 67000 Yes


 It will be less than the training set
 It represents 20% to 30% of the original
 The model must be tested on new observations (i.e., observations outside the
correlations that the model has learned) How does it
 we build our model of ML on the training set (it learns the correlations)
 The model will make predictions on the same observations of training set (i.e., for each
observation of the training set, it will predict if the client corresponding to the observation
buys Yes or No the product so it will predict 1 or 0.
 We count the number of errors committed by the model.
 we divide the number of errors by the total number of observations of the training set which
gives the Precision Coefficient (CP) of the training set.

CP(training set)

 We establish new predictions on the test set


 We calculate the Precision Coefficient (CP) of the model of the test set

CP(test set)

 Finally we compare the two precisions


 If CP(training set) and CP(test set) are similar (very
close)
There was a good training of the model i.e., the model established good correlations

The model is perfect

 If CP(training set) and CP(test set) are not similar (CP(training set) > CP(test set) )

There was an over-learning on the training set (i.e., on correlations of the training set)

The model is unable to predict correctly new


observations

The model didn’t learn automatically the


correlations
We import the class ‘train_test_split’ from the model
‘model_selection’
from model_selection import
train_test_split
In fact, train_test_split isn’t a class. It is a function that’s why we are not going to create objects

Data

X_train: The matrix of the Independent X_test: The matrix of the Independent
Variables of the trainning set Variables of the test set

train_test_split

y_train: The vector of the Depenedent y_test: The vector of the Dependent
Variable of the training set Variable of the test set
To obtain these four entities, we write
X_train,X_test,y_train,y_test=train_test_split()

We put the arguments of the function train_test_split


 The matrix of the Independent Variables X
 The vector of the Dependent Variable y
 train_size or test_size (the proportion of observations of the dataset
that we want to put in the test set in our case we choose 20%)

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2
)
from model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)

It doesn’t
work!!!
We forgot the Library scikit-learn

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.
2)

We add a last argument, that is random_state to unify the choice of the training set and the test set

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
X_train Y_train
Country Age Salary Purchased
Initial Data
0 1 0 40 63777.778 1 Country Age Salary Purchased
1 0 0 37 67000 1 1 0 0 44 72000 No
0 0 1 27 48000 1 0 0 1 27 48000 Yes
0 0 1 38.7778 52000 0 0 1 0 30 54000 No
1 0 0 48 79000 1 0 0 1 38 61000 No
0 0 1 38 61000 0 0 1 0 40 63777.778 Yes
1 0 0 44 72000 0 1 0 0 35 58000 Yes
1 0 0 35 58000 1 0 0 1 38.778 52000 No
X_test Y_test 1 0 0 48 79000 Yes
Country Age Salary Purchased 0 1 0 50 83000 No
0 1 0 30 54000 0 1 0 0 37 67000 Yes
0 1 0 50 83000 0
The training set is composed by the matrix of the Independent Variable X_train and the vector of
Dependent Variable y_train. It comprises 8 observations which represents 80% of the dataset

 Our model of Machine Learning will learn the correlations on the training set in order to
have predictions on the behavior of the client (if he will buy the product or not).

 We will test if they are good correlations on the test


set
The test set is composed by the matrix of the Independent Variable X_test and the vector of
Dependent Variable y_test. It comprises 2 observations which represents 20% of the dataset

 we will predict if the two clients composing the test set have bought or not the prduct and if
for example the model can predict that these two clients did not buy the product then we will
have a precision of 100%

Here we propose a small dataset with 10 observations in order to understand how does it work. In
reality, we will work with much larger dataset. We will have more than 8 observations in the
training set and more than 2 observations in the test set.
1. Data Preprocessing
1.6. Feature
Scaling
Country Age Salary
Final step in the Data
Preprocessing 1 0 0 44 72000
It consists to put the variables on the same scale 0 0 1 27 48000
0 1 0 30 54000
Why 0 0 1 38 61000
?
0 1 0 40 63777.778
One variable does not crush the others in the Machine Learning
model 1 0 0 35 58000
In our dataset, we see the IV Age does not have the same scale at all 0 0 1 38.778 52000
as the IV Salary, indeed the Age takes values between 27 and 50 1 0 0 48 79000
and the Salary takes values between 48000 and 83000. 0 1 0 50 83000
1 0 0 37 67000
It is not the same scale.

The Salary can dominate and even crush the variable Age. This latter will not be taken into
account in the model while it may have an impact on the DV Purchased.
Age Salary

44 72000

27 48000

30 54000

38 61000

40 63777.777
8
35 58000

38.777 52000
8
48 79000

50 83000

37 67000

(83000-48000)²=1225 *
(50-27)²=441
For the feature scaling we will use

The Library Scikit Learn The module preprocessing The class StandardScaler

from sklearn.preprocessing import StandardScaler

Standarisation rescales data to have a mean Normalisation rescales the values into range
equl to 0 and standard deviation equal to 1 (i.e., [0,1]. This might be useful in some cases where
a reduced centered variable). However, we all valuess needed the same positive scale.
conserve the outliers However the outliers from the data are lost
Standarisation vs
Normalisation
Standarisation Normalisation

Range [-1;1] Range [0;1]

some values outside …….

Normal histogram Uniform histogram

Conserve the outliers Lose the outliers

For most applications standarisation is recommended

How to create an histogram


import matplotlib.pyplot as plt
plt.hist(Y)
plt.title(‘Histogram’,fontsize=10)
plt.show()
Now, we create an object of the class StandardScaler that we note ‘sc’

sc=StandardScaler(
)
Here we have no arguments to enter, we keep the default values

We must now link our object to what we want to do (i.e., on which we want to apply feature
scaling that is the matrix of Independent Variables X)

In the previous step, we created the training set and the test set ?

The feature scaling will be applied to the matrix of independent variables X_train (i.e., it is on
X_train that our object ‘sc’ will be fited)

We compute the mean and the standard deviation of each independent variable of X_train in
order to use the standarisation method
Our object ‘sc’ will be linked to X_train

We apply the transformation method to X_train and to x_test

Why the
X_test?
The training set and the test set have a similar distribution
Mean(X_train) ≈ Mean(X_test) and Standard Deviation (X_train) ≈ Standard Deviation (X_test)

38.472 40 6.2085308246 10
62597.22 68500 10193.183997162025 14500

np.mean(X_train[:,3:5],axis=0) np.sqrt(np.var(X_train[:,3]))
or or
np.mean(X_train[:,3]) and np.mean(X_train[:,4]) np.std(X_train[:,3]) and np.std(X_train[:,4])

X_train =
0.5 0.125 0.375 38.472 62597.22
Means
Country Age Salary Country Age Salary

0 1 0 40 63777.778 -1 2.64575 -0.774597 0.263068 0.123815

1 0 0 37 67000 1 -0.377964 -0.774597 -0.253501 0.461756

0 0 1 27 48000 -1 -0.377964 1.29099 -1.9754 -1.53093

0 0 1 38.778 52000 -1 -0.377964 1.29099 0.0526135 -1.11142

1 0 0 48 79000 1 -0.377964 -0.774597 1.64059 1.7203

0 0 1 38 61000 -1 -0.377964 1.29099 -0.081311 -0.167514

1 0 0 44 72000 1 0.377964 -0.774597 0.951826 0.986148

1 0 0 35 58000 1 -0.377964 -0.774597 -0.597881 -0.482149

Standard
Deviations
0.5 0.33071891388307384 0.4841229182759271 6.2085308246 10193.183997162025
Even the DUMMY variable are scaled
However, you can choose not to scale these variables. Anyway they will be in a scale very close
to that obtained by Standard Scaler

The new scale of the Age and the new scale of the Salary are similar

The values of each variable are almost in the new scale and in range [-1,1]

No risk of dominating one variable relative to the others in the modelss of the ML
For X_test neither fit_transform method nor fit method is used because the training set and the
test set have a similar distribution and our object ‘sc’ has already fited to the trainig set so we
will transform directly the test set using the mean and the standard deviation of the training set

X_test=sc.transform(X_test)

0 1 0 40 68500 Country Age Salary

Means -1 2.64575 -0.774597 -1.45883 -0.901663

ning -1 2.64575 -0.774597 1.98496 2.13981


Tr ai
Country Age Salary

0 1 0 30 54000 Country Age Salary


0 1 0 50 83000 .. 0 .. -1 -1
Test
.. 0 .. 1 1
Standard
Deviations
0 1 0 10 14500 All is under the same scale
we did not apply feature scaling to the Dependent Variable Y because it takes the values 0 and 1
which are already under the same scale as those of the independent variable.

However in the general case if the Dependent Variable Y takes very large values, we must apply
the feature scaling to transform Y under the same scale that the Independent Variables

Our data is more and more ready for modeling


Data Preprocessing
Template
1) Importing the Libraries 2) Importing the dataset
import numpy as np dataset=pd.read_csv('Data.csv')
import matplotlib.pyplot as plt X=dataset.iloc[:,:-1].values
import pandas as pd Y=dataset.iloc[:,3].values
3) Taking Care of Missing Data
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=
0) 5) Splitting the dataset into Training Set and Test Set
imputer=imputer.fit(X[:,1:3]) from sklearn.model_selection import train_test_split
X[:,1:3]=imputer.transform(X[:,1:3])
4) Encoding the Categorical Data X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0
.2)
from sklearn.preprocessing import LabelEncoder,
OneHotEncoder
labelEncoder_X=LabelEncoder() 6) Feature Scaling
X[:,0]=labelEncoder_X.fit_transform(X[:,0]) from sklearn.preprocessing import
onehotencoder=OneHotEncoder(categorical_features=[0]) StandardScaler
X=onehotencoder.fit_transform(X).toarray() sc=StandardScaler()
labelencoder_Y=LabelEncoder() X_train=sc.fit_transform(X_train)
from sklearn.compose import ColumnTransformer
Y=labelencoder_Y.fit_transform(Y) X_test=sc.fit_transform(X_test)
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder =
'passthrough’)
Training Set
Country Age Sal Pur

Count Ag Sal Pu -1 2.645 -0.774597 0.2638 0.12381 1


e r
1 -0.377 -0.774597 -0.253 0.46175 1
0 France 44 72000 No
-1 -0.377 1.29099 -1.975 -1.5309 1
1 Spain 27 48000 Ye
s -1 -0.377 1.29099 0.0526 -1.1114 0
2 German 30 54000 No
1 -0.377 -0.774597 1.6405 1.7203 1
y
3 Spain 38 61000 No -1 -0.377 1.29099 -0.081 -0.1675 0

4 German 40 ----- Ye 1 0.377 -0.774597 0.9518 0.98614 0


y s
1 -0.377 -0.774597 -0.597 -0.4821 1
5 France 35 58000 Ye
s
Test Set
6 Spain --- 52000 No
Country Age Sal Pur
7 France 48 79000 Ye
-1 2.645 -0.77459 -1.45883 -0.9016 0
s
8 German 50 83000 No -1 2.645 -0.77459 2.1398 0

You might also like