Data Science Chapitre 2

Data Science
Mahdi Louati
3 GLID
October, 11th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models
0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models
3. Classification 4. Clustering 5. Dimensionality Reduction

Models
3.1. Logistic Regression 4.1. K-Means Clustering 5.1. Principal Component Analysis (PCA)
3.2. K-Nearest Neighbors 4.2. Hierarchical Clustering 5.2. Linear Discriminant Analysis (LDA)
3.3. Support Vector Machine (SVM) 5.3. Kernel PCA
3.4. Kernel SVM
3.5. Naïve Bayes
3.6. Decision Tree classification
6. Reinforcement 7. Natural langage Processing 8. Deep

Learning
6.1. Upper confidence Boundary (UCB) (NLP) Learning
8.1. Artificial Neural Networks
6.2. Thompson Sampling 8.2 Convolution Neural Networks
Section 2
02 Regression Models
2.1. Simple Linear Regression
(SLR)
2.2. Multiple Linear Regression
(MLR)
2.3. Polynomial Regression
2.4. Support Vector Regression
(SVR)
2.5. Decision Tree Regression
The regression models are used to predict a continuous real value, such as salary, the price of an
apartment, etc. if your independent variable is time, then your model predicts future values.
Otherwise, your model predicts unknown values.
2. Regression
Models
2.1. Simple Linear Regression
(SLR)
Linear regression is the simplest and most preferred predictive model when it works because it is
easily interpretable unlike nonlinear models that keep their secrets if we stick to their
coefficients.
We will discover the basics of the Simple Linear Regression (SLR)
model
In SLR model, we predict the value of one variable
Y based on another variable X
X is called the Independent Variable (IV)

and Y is the Dependent Variable (DV)
Why is it called
simple ?
It examines the relationship between only two variables
Why is it called linear?

Experience Salary
0 1.1 39343
import pandas as pd
dataset=pd.read_csv(‘Salary_Data.csv’ 1 1.3 46205
)
X=dataset.iloc[:,:-1].values
2 2.0 43525
Y=dataset.iloc[:,-1].values 3 2.2 39891
4 2.9 56642
import matplotlib.pyplot as plt 5 3.0 60150
import pandas as pd
dataset=pd.read_csv(‘Salary_Data’)
6 3.7 57189
X=dataset.iloc[:,:-1] 7 4.5 61111
Y=dataset.iloc[:,-1]
plt.scatter(X,y,color='red') 8 5.1 66029
plt.title('Salary vs Experience')
plt.xlabel('Years of experience')
9 6.0 93940
plt.ylabel('Salary') 10 7.1 98273
plt.show()
11 9.0 1055
12 9.5 11696
To obtain a relationship between the years of experience and the salary which is a mathematical
model. We need to find a straignt line that could best represent the data points
Dependent Variable (DV) Independent Variable (IV)

Simple Linear
Y=+
Regression
Constant Coefficient
whereis the the slope of the line and is the Y-intercept, when the coefficinatsandare determined
you have obtained the SLR model
+ * Experience
Note of a student = + * Number of working hours
There is a correlation between the IV X and the DV
is the value of the Y when X=0
The Experience is the x-axis and the Salary is the y-axis
+ * Experience
The coefficient is the slope of the line

How the salary evolves according to the number of years of experience ?
The SLR is used to model the salary progression by the number of years of work. It is based on
real observations that have occurred and that we want to integrate into our model.
The line closest to the observations

Ordinary Least Square Method
How does the SLR model find the best line that approaches all observations?
S=
Min(S)
The linear regression takes all the possible line and keeps the one that minimizes S and
it is actually the best line (that is the ordinary least square method).
Implementation
Experience Salary
0 1.1 39343
Importing the Libraries
1 1.3 46205
import numpy as np
import matplotlib.pyplot as plt 2 2.0 43525
3 2.2 39891
import pandas as pd 4 2.9 56642
Importing the dataset
5 3.0 60150
dataset=pd.read_csv('Data.csv')
(‘Salary_Data.csv’
X=dataset.iloc[:,:-1].values
)
6 3.7 57189
Y=dataset.iloc[:,3].values 7 4.5 61111
8 5.1 66029
Taking care of missing values
9 6.0 93940
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
10 7.1 98273
imputer=imputer.fit(X[:,1:3]) 11 9.0 1055
X[:,1:3]=imputer.transform(X[:,1:3]) 12 9.5 11696
Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
labelencoder_Y=LabelEncoder()
Y=labelencoder_Y.fit_transform(Y)
Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split
(X,Y, test_size=1.0/3
test_size=0.2 ,random_state=0)
test_size=1/3
Feature scaling
from sklearn.preprocessing
the Dependent import
Variable StandardScaler
expresses itself as a linear combination of the
sc=StandardScaler()
Independent Variables and so even if the variables are not under the
X_train=sc.fit_transform(X_train)
same scale, the quantities (coeff * VI) will be under the same scale
X_test=sc.fit_transform(X_test)
Import the library The model The class
scikit-learn linear_model LinearRegression
from sklearn.linear_model import LinearRegression
Without parameters because it

Create an object of this class ‘regressor‘ is the more simple model
regressor=LinearRegression()
Fit our object ‘regressor’ to the training set using the fit method
regressor.fit(X_train,y_train)
regressor=LinearRegression()
regressor.fit(X_train,y_train)
The ordinary least squares method will execute to find the optimal coefficients so that the linear
regression line will be closest to all points in terms of minimum of the sum of the square
distances between the observation points and the prediction points.
We can give predictions
We can visualize the results with matplotlib

Prediction
s
The test set is composed by new observations
We take our ‘regressor’ We use a new method ‘predict’
y_pred=regressor.predict(X_test)
y_pred ≈ y_test
We can also make new predictions on observations outside the test set
regressor.predict(15)
Visualization
Comparison on the same graph of the observation points and the prediction points
Visualize the result of the Simple Rinear regression
Verify that our model is a good linear model
Use the library matplotlib The module pyplot Use the function‘scatter’
plt.scatter(X_train,y_train,color=‘red’)
For the prediction points, we would like to give the line and not the points then we don’t use the
function ‘scatter’ but the function ‘plot’
plt.plot(X_train,regressor(X_train),color=‘blue’)
We can give a title for the graph
plt.title(‘Salary vs experience’)
We can add
plt.xlabel(‘Experience’)
plt.ylabel(‘Salary’)
To display the graph
plt.show()
plt.scatter(X_train,y_train,color=‘red’)
plt.plot(X_train,regressor.predict(X_train),color=‘blue’)
plt.title(‘Salary vs experience’)
plt.xlabel(‘Experience’)
plt.ylabel(‘Salary’)
plt.show()
Errors of
Prediction
from sklearn import metrics
MAE= metrics.mean_absolute_error(Y_pred,y_test)
MSE= metrics.mean_squared_error(Y_pred,y_test)
RMSE= metrics.mean_squared_error(Y_pred,y_test)**0.5
MAE=
3737.4178618788987
MSE =
23370078.800832972
RMSE =
4834.260936361728
Is it a significant
model ?
Criteria
C1: X is known without errors ( Xreal=Xmeasured)
C2: The error in question is related to the variable Y (Ymeasured=Yreal + ε) ε: error
C3: The error is independent of the variable X (i.e., V(ε) is consatnt and does not dependent on X
(Property of Homoscedasticity)
C4: There is an average linear relation between X and Y (i.e., knowing X=x, E(Y)=αx+β)
Examples
The correlation coefficient is a measure of the linear relationship between two variables.
r(X,Y)=
The values range between -1.0 and 1.0
The correlation coefficient calculates the strength of the linear relationship between X and Y.
A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a
perfect positive correlation.
A correlation of 0.0 shows no linear relationship between the two variables.

x=dataset.iloc[:,0].values
r=(np.mean(x*Y)-np.mean(x)*np.mean(Y))/(np.std(x)*np.std(Y))
sigma=np.corrcoef(x,Y)
r(Experience, Salary) = 0.9782416184887603 Strong linear correlation
Y=+
= and = -
= 9449.962321455081
= 25792.200198668666
Coefficient of
Determination
The R-Squared (R²) also called the coefficient of determinantion is a useful important parameter
in statistics
What is the R² coefficient? How to use the R² coefficient ?
min
Ordinary Least Square Method

SST = SSR + SSE
Total Sum Regression Sum Error Sum

of Squares of Squares of Squares
SST SSR= SSE
SSE is also called the Residual Sum of Squares
: Observed Values of the dependent variable Y
Ȳ: Average value of the dependent variable Y
Y given the value of X

SST > 0
The SLR model is perfect ^i

for all 1 ≤ i ≤ n , Y i = Y SST = SSR
R² = = 1-
The SLR model is perfect 0 SST ≈ SSR R² ≈ 1
R² measures the quality of the SLR model compared to the average of the observations
0< R² <1
from sklearn.metrics import r2_score

r2_score(Y, Y_pred) 0.9569566641435086
2. Regression
Models
2.2. Multiple Linear Regression
(MLR)
In the original version of Linear Regression that we developed, we have a single feature X, the
size of the house, and we wanted to use that to predict why the price of the house and this was
our form of our hypothesis. But now imagine, what if we had not only the size of the house as a
feature or as a variable of which to try to predict the price, but that we also knew the number of
bedrooms, the number of house and the age of the home and years. It seems like we would give
more information with to predict correctly the price. i.e., we show how simple linear regression
can be extended to accommodate multiple input features and we discuss best practices for
implementing this new model.
Dependent Variable (DV)
Independent variable (IV)
Simple Linear
Regression Y=+
Constant Coefficient
Coefficients
Multiple Linear
Regression Y=+ +
Dependent Variable (DV)

Independent Variables (IVs)
Years of Experience level of study
Salary
Years in the Company Number of Children
Constant Coefficients
the value of Y when = =…= =0
Grade of a student in an exam =+ +
number of hours spent studying
Number of hours of sleep just before the exam number of revised chapters
Warnin
g
Assumptions of the Linear
Regression
If for one of your projects you need to build a robust Linear Regression model, you have to check
these five properties
Exogeneity: The IVs X are not dependent on the DV Y i.e., This does not mean that there is no
VIs connection, since Y is dependent, it will
still depend on the IVs X and on the error
Lack of multicollinearity of the VIs
term.
Homoscedasticity: The error term is the same across all values of the IVs
Errors Multivariate normality of the errors
Independence of the errors
The Linear Regression is not our last round, it is just an intermidiary step before we start
creating powerful new models
The DUMMY variables
Profit R& D Spend Admin Marketing State
192,261.83 165,349.201 136,897.80 471,784.10 New York
191,792.06 162,597.70 151,377.59 443,898.53 California
191,050.39 153,441.51 101,145.55 407,934.54 California
182,901.99 144,372.41 118,671.85 383,199.62 New York
166,187.94 142,107.34 91,391.77 366,168.42 California
DV Y The Independent Variables used to predict the DV Y

Dummy Variables
𝑋 1
𝑋2 𝑋3 𝑋4
Profit R& D Spend Admin Marketing State NewYork
New York California
California
192,261.83 165,349.201 136,897.80 471,784.10 New York 1 0
191,792.06 162,597.70 151,377.59 443,898.53 California 0 1
191,050.39 153,441.51 101,145.55 407,934.54 California 0 1
182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0
166,187.94 142,107.34 91,391.77 366,168.42 California 0 1
𝑏𝐶 𝐷1 𝐷2
Y=+ + 44 𝑋
𝐷411 + 𝐷
𝑏5 𝐷
12
𝐷1 + 𝐷2 =1
are dependent
Always omit one dummy variable
What happens if we introduce in our model the two DUMMY variables New York and California?
We duplicate a variable Problem of multicollinearity
The model can not distinguish the effect of from the effect of
Mathematically, you can not have the quantities at the same time
That is the trap of dummy variables Always omit one dummy variable
Building a Model (Step-By-Step)
𝑿𝟐 𝑿𝟑 𝑿𝟒
𝑿𝟏 Y 𝑿5
𝑿𝟕
Why do we need to get rid of some IVs?
Why we will not use them all in the model?
Garbage in, Garbage out (GIGO): If you throw a lot of variables from
your model in the trash then your model will not be reliable and may
not be a good model because it will not do what it is supposed to do
or at least it is less likely to run correctly.
If you include all the independent variables in your model, you must
explain the impact of each variable on Y which is very difficult and
expensive to do if the number of these variables is important.
Keep only the most important variables that have more impact on the Dependent Variable.
How to build a
model
There is Five methods of building
models:
All-in
Backward Elimination
Forward Selection Stepwise Regression
Bidirectional Elimination
Score Comparison
All-in
Integrate all the independent variables you have in
the multiple regression model
 Prior knowledge: You already have knowledge of your

model
 You have to (you have no choice) to build the model

with all these variables (expert opinion)
 Prepare for the Backward Elimination method

The P-Value
In statistics, every conjecture concerning the unknown distribution, parameter, … is called
a statistical hypothesis.
The Methods of verifying a statistical hypothesis are called statistical tests. In fact, a statistical
test is a decision-making: accept or reject the hypothesis. (i.e., the aim of a statistical test is to
verify whether this hypothesis is not false).
The P-value is one of the statistical test developped by the statistician Ronald Fisher (1890-1962)
used to quantify the statistical significance under a null hypothesis testing which is
an argument adapted to statistics. The general idea is to prove that the null hypothesis is not
verified because in the case where it would be the observed result would be highly improbable. It
is therefore an extension of the principle of proof by the absurd. In statistics, a result is said to be
statistically significant when the P-value is lower than the α probability of rejecting the null
hypothesis when it is true. The α probability is usually 0.05, but may vary by study and contex.
The P-value is the likelihood of the data assuming the null hypothesis .
Backward Elimination
STEP 1: Select a significance level to stay in the model (e.g. SL=0.05)
STEP 2: Fit the full model with all possible predictors
STEP 3: Consider the predictor with highest P-value. If P> SL, go to otherwise FIN
STEP 4,
STEP 4: Remove the
predictor
STEP 5: Fit the model without this variable.
FIN: Your Model is

Your model is ready when in STEP 3 youReady
can not find an
independent
Forward Selection
STEP 1: Select a significance level to enter the model (e.g. SL=0.05=5%)
STEP 2: Fit all simple regression model (Y~ ). Select the one with the lowest P-value
STEP 3: Keep this variable and fit all possible models with one extra predictor added to the
one(s) you already have
STEP 4: Consider the predictor with the lowest P-value. If P < SL go to otherwise FIN
STEP 3,
FIN: Keep the Previous Model

Your model is ready when in STEP 4 you can not find an independent variable with P-
value less than the significance level. Our model is the previous and not the last one.
Bidirectional Elimination (Stepwise Regression)
STEP 1: Select a significance level to enter and to stay in the model

(e.g. SLENTER=0.05=5% and SLSTAY= 0.05=5%)
STEP 2: Perform the next step of Forward Selection

(new variable must have P < SLENTER to enter)
STEP 3: Perform all steps of Backward Elimination (variables must have P< SLSTAY to stay)
STEP 4: No new variables can enter and no old variables can exit
FIN: Your Model is Ready

Your model is ready when you can not add a new independent variable in STEP 2 and delete an
old independent variable in STEP 3.
All Possible Models
STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
STEP 2: Construct all possible linear regression models: (-1) total combinations
STEP 3: Select the one with the best criterion AIC=2k-ln(L)

k is the number of parameters to
estimate and L is the maximum of
the likelihood function of the model
FIN: Your Model is Ready 10 columns means 1023 models
It is the best method but it is the most resource-intensive method. In fact, if you have a lot of
variables this method is not recommended since it will take a lot of time and resource
Implementation
We will focus on the second method that is the Backward Elimination method because it is the
fastest and the most efficient method
Numerical: Regression Model
The Dependent Variable Y
Categorical: Classification
In this section, we deal with a business problem where we have to predict the profit of start-ups
based on several available informations. These informations will be our independent variables.
We have 50 observations.
X 1 : R & D Spend
: Administration Spend
Y: Profit
: Marketing Spend
: State New York, California and Florida
Choose the good working file. It contains the file 50_Start-

ups.csv
Save your file Python in the same working

Take care of the data preprocessing phase
We go back to the folder “data_preprocessing.py”
We take all sections of code and we run them one by one
We import the libraries that we use every time: Numpy, Pandas and matplotlib.pyplot
We change the dataset ‘50_Startups.csv’
Import the dataset and create the matrix of VIs
The DV Y is the last column
50 Start-ups Profit R & D Spend Administration Marketing State

we imagine that we are working for a foundation that wonders which star-up to invest and so to
help investors choose, we will build a model of Multiple Linear Regression that will be able to
understand the correlations between these data (the VIs and the DV)
There is no missing data Delete the section missing data from the code
Managemant of the categorical data State is composed by three categories
We take care of the index of the State’s

It is a nominal categorical data:
column
There is no order relation between
We replace 0 by 3 in the initial code
the different categories We
create three dummy variables one
for each category
LabelEncoder transforms the texts in numerical values 0;1 and
2
OneHotEncoder creates three dummy variables corresponding to New York, California and
Florida
The DV is not categorical (numerical) We don’t encode Y
Variable State after
Initial matrix X of VIs before transformations
transformations
R& D Spend Admin Marketing State State
165,349.201 136,897.80 471,784.10 New York 0 0 1
162,597.70 151,377.59 443,898.53 California 1 0 0
153,441.51 101,145.55 407,934.54 Florida 0 1 0
144,372.41 118,671.85 383,199.62 New York 0 0 1
142,107.34 91,391.77 366,168.42 Florida 0 1 0
131876.9 99814.71 362861.36 New York 0 0 1
134615.46 147198.87 127716.82 California 1 0 0
130298.13 145530.06 323876.68 Florida 0 1 0
120542.52 148718.95 311613.29 New York 0 0 1
123334.88 108679.17 304981.62 California 1 0 0
101913.08 110594.11 229160.95 Florida 0 1 0

State R& D Spend Admin Marketing
The matrix X of VIs contains six
0 0 1 165,349.201 136,897.80 471,784.10
columns.
1 0 0 162,597.70 151,377.59 443,898.53
0 1 0 153,441.51 101,145.55 407,934.54
0 0 1 144,372.41 118,671.85 383,199.62
The three dummy variables corresponding to the 0 1 0 142,107.34 91,391.77 366,168.42

state variable represent the first three column of 0 0 1 131876.9 99814.71 362861.36
X.
1 0 0 134615.46 147198.87 127716.82
0 1 0 130298.13 145530.06 323876.68

New York corresponds to the third
0 0 1 120542.52 148718.95 311613.29
column
California corresponds to the first column 1 0 0 123334.88 108679.17 304981.62
Florida corresponds to the second column 0 1 0 101913.08 110594.11 229160.95
We must avoid the trap of dummy

R& D Spend Admin Marketing
We have to remove one dummy variable
0 1 165,349.201 136,897.80 471,784.10
0 0 162,597.70 151,377.59 443,898.53
We add this line in the encoding section 1 0 153,441.51 101,145.55 407,934.54
0 1 144,372.41 118,671.85 383,199.62
1 0 142,107.34 91,391.77 366,168.42

X=X[:,1:] 0 1 131876.9 99814.71 362861.36
0 0 134615.46 147198.87 127716.82

We remove the first column (we take the columns 1 0 130298.13 145530.06 323876.68
of X from the second), we can choose any column
0 1 120542.52 148718.95 311613.29
(1-3)
0 0 123334.88 108679.17 304981.62
New York represents the second column and 1 0 101913.08 110594.11 229160.95
Florida represents the First column
We create the training set, we choose 20% of the observations for the test set and 80% for the
training (i.e., 10 observations for the test set and 40 observations for the training set)
We build the model on the correlation of the 40 observations of the training set and we will
establish new predictions on the 10 observations of the test set
The training set (40 observations) contains X_train and Y_train. The test set (10 observations) is
composed by X_test and Y_test
For the ‘Feature Scaling’ section, as in the Simple Linear Regression model, the coefficients can
adopt their scales so that the products are on the same scale
We don’t need the ‘Feature Scaling’ section
The data preprocessing phase is completed

1) Importing the Librairies
R&D Spend Administ Marketing
import numpy as np
import matplotlib.pyplot as plt 0 1 165,349.201 136,897.80 471,784.10
import pandas as pd 0 0 162,597.70 151,377.59 443,898.53
2) Importing the Dataset
1 0 153,441.51 101,145.55 407,934.54
dataset=pd.read_csv('50_Startups.csv')
X=dataset.iloc[:,:-1].values 0 1 144,372.41 118,671.85 383,199.62
Y=dataset.iloc[:,4].values 1 0 142,107.34 91,391.77 366,168.42
3) Encoding Categorical Data
from 0 1 131,876. 99,814.71 362,861.36
from sklearn.compose import
sklearn.preprocessing ColumnTransformer
import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder() 0 0 134,615.46 147,198.87 127,716.82
X[:,3]=labelEncoder_X.fit_transform(X[:,3]) 1 0 130,298.13 145,530.06 323,876.68
ct = ColumnTransformer([("State", OneHotEncoder(), 0 1[3])],
120,542.52 148,718.95
remainder = 'passthrough’)311,613.29
X = ct.fit_transform(X)
X=X[:,1:] 0 0 123,334.88 108,679.17 304,981.62
X=X[:,1:]
4) Splitting the dataset into the Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2, random_state=0)
In order to build the Multiple Linear Regression model, we only need to copy what we did in the
Simple Linear Regression model for the part of build the model. In the part make new
predictions, we take almost the same work modulo a small modification.
From the class we create the object We link the regressor to the training
‘LinearRegression’ ‘regressor’ set (X_train and Y_train)
5) Fitting Multiple Linear Regression to the Training

set
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
In order to establish new predictions, we take the object ‘regressor’, then we use the method
predict which is applied to the test set
6) Predicting the Test set results State R&D Admin Marketing Y_test Y_pred
Y_pred = regressor.predict(X_test) 1 0 66051.5 182646 118148 103282 103015
0 0 100672 91790.6 249745 144259 132582
1 0 101913 110594 229161 146122 132448
1 0 27892.9 84710.8 164471 77798.8 71976.1
Y_pred ≈ Y_test 1 0 153442 101146 407935 191050 178537
0 1 72107.6 127865 353184 105008 116161
0 1 20229.6 65947.9 185265 81229.1 67851.7
It seems to be good linear correlations 0 1 61136.4 152702 88218.2 97483.6 98791.7
between the Profit and the VIs 1 0 73994.6 122783 303319 110352 113969
1 0 142107 91391.8 366168 166188 167921
The Multiple Linear Regression model is a significant

The Mean Squared Error (MSE) assesses the quality of a predictor (the model).
The Root Mean Square Error (RMSE) measures the differences between the predicted values by
the model and the observedvalues. It represents the square root of MSE.
MSE=
7) Model Evaluation
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)
RMSE= 9137.990152794797 The model is significant

How to make a new prediction?
Regressor.predict(15) We have several Independent Variables, we will do a changement

here.
We have to enter (in order) the values of the Independent Variables corresponding to the
information of the new start-up
Taking R&D Spend=130,000, Admin=140,000, Marketing=300,000 and State=New York
Actually, to enter the state we aren’t going to enter New York as a string of characters because
the predict method expects something like in X_test (i. e., the state variable isn’t in a single
column but in three dummy variables, so you have to enter the combinations of New York in 0
and 1).
New York is the third column in dummy variables then it corresponds to 0 (California) and 0
(Florida).
regressor.predict(0,1,130000,140000,300000)
It doesn’t work
When we have several Independent Variables, we must enter the new information in a tabular
form because this sequence of numbers (0,1,130000,140000,300000) has no sense for Python.
This latter wait the new information in line vector form. We will use the library Numpy (np)
regressor.predict(np.array(0,1,130000,140000,300000))
It doesn’t work
regressor.predict(np.array([[0,1,130000,140000,300000]]))
The information of the new observation

158,691.7540
is a line vector and not a column vector
8
Coefficient of Determination Adjusted R²
Does improve the model or not?
R² = = 1-
R² with
Y=+ R² after adding
Y=+ + +
Initial salary Years of experience Years of study Years of experience before

in the company joining the company
SSE min
R² mesures the quality of the prediction R² never decreases

If you add a new variable to your model, it will influence it in some way because adding a third
variable (for example) will help to minimize SSE (i.e., the Multiple Regression Linear model
will find a coefficientthat will further minimize SSE) or SSE remains the same and does not
change (i.e., the Multiple Regression Linear model will take the coefficient ≈ 0). So R² will either
increase or keep itself (almost)
Taking the example of the salary of an employee in a company, where is the experience years, is
the number of years of study and is the last number of his phone number
If the variable has nothing to do with the salary, there is even a slight correlation with it, linked
to chance, (i.e., the model will even consider it). So, since the correlation will be zero, the
coefficient associated to the variable will be almost zero. So after the addition of the variable the
model will either minimize SSE (very little) or keep its value
By adding a new variable, the increase of R² will still take place even if it is very small and that
is why we need a new measure that takes into account the number of independent variables and
measures the quality of the model that is the ‘adjusted R-square’ denoted by Adj R²
Penalty Coefficient
R² = = 1-
Adj R² = 1- (1-R²) p: The number of the independent variables

n: The size of the sample
Sum of Squares Definitions Notations Degree of Freedom
Total SST n-1
Regression SSR p
Error SSE n-p-1
constraint
The Adj R² penalizes you when you add a variable not corrolated to Y.
The number of independent variables appears at the denominator, when you add an independent
variable the penalty coefficient increases. This implies that the quantity (1-R²) increases and then
the Adj R² decreases.
Furthermore, when the number of the independent variable increases the R² increases. It follows
that the quantity (1-R²) decreases and then the Adj R² increases.
The penalty coefficient counterbalance this increase by reducing the Adj R².
if you add a regressor that improves your model (strongly correlated to Y) then R² will increase
considerably and it will considerably increase your Adj R² although the ratio will decrease.
Adj R² is an excellent metric. It is a very powerful tool to measure

the quality of your model.
Encoding the Independant Variable
from
from sklearn.compose
sklearn.preprocessingimport ColumnTransformer
import LabelEncoder, OneHotEncoder
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough’)
X = ct.fit_transform(X)
X=X[:,1:5]
X=X[:,1:]
Fitting Multiple Linear Regression
regressor=LinearRegression() R² = 0.9485223547171563
regressor.fit(X,Y) Adj R² = 0.9426726222986513
Predicting and R²
Y_pred=regressor.predict(X)
from sklearn.metrics
SSE= sum((Y-Y_pred)**2) import r2_score
r2=r2_score(Y,
SST Y_pred)
= sum((Y-np.mean(Y))**2)
Adj_r2 = 1= -1 (1
r_squared - r2 ) * ((len(Y) - 1) / (len(Y) - X.shape[1]-1))
- SSE/SST
adjusted_r_squared = 1 - (1-r_squared)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
S=np.array([1,2,1,0,1,2,1,0,1,2,1,0,1,2,1,2,1,2,1,0,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,0,1,2,1,2,1,2,1,2,1,2,1,2,
1,0])
HT=np.corrcoef(S,Y) HT=-0.09997243 Weak correlation between S and Y
U=np.c_[X,S]

regressor=LinearRegression() R² = 0.9485223547171563
regressor.fit(U,Y)
Adj R² = 0.9426726222986513
Y_pred=regressor.predict(U)
from sklearn.metrics import r2_score
r2=r2_score(Y, Y_pred) R² = 0.9507757894286009
adjr2 = 1 - (1 - r2 ) * ((n - 1) / n- p - 1)) Adj R² = 0.9401821291363964
With n=50 and p=6
Coefficients of the MLR and Evaluation of the
model
Coeff Values
42554.16761772438
Y=+ +
-9.59284160e+02
6.99369053e+02
Constant=regressor.intercept_ 7.73467193e-01
Coefficients=regressor.coef_
3.28845975e-02
3.66100259e-02
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)
from sklearn import metrics MSE = 83502864.03257468

MAE= metrics.mean_absolute_error(Y_pred,y_test)
MSE= metrics.mean_squared_error(Y_pred,y_test) RMSE = 9137.990152794797
RMSE= metrics.mean_squared_error(Y_pred,y_test)**0.5
Y=+ +
Y(1)=+ (1)+
Y(2)=+ (2)+
…. …… ….. ….. ….. …… ……. ….. …….. ….. ….. ….. ……
……. ….... …….. …… ….. ….. ….. …… …… …… …… ….. …..
….. …… ……. ….. ……..
Y(i)=+ (i) + (i)
Y(1) 1 (1) (1) …..(1)

Y(2) 1(2) (2) …… (2)
Y= . …………..………………. . Y=AB+ε
= …………………………… . Y=AB+ ε
. 1(i) (i)…......(i)
Y(i) Y
Y
-1 -1 -4 1 -1 -1 4 0 0
-1 1 2 A= 1 -1 1 = 0 4 0
1 -1 0 1 1 -1 0 0 4
1 1 2 1 1 1
0 0
= 0 0
Y= 1
0 0
2
Y= + 2
Template Multiple Linear Regression
1) Importing the Librairies 5) Fitting MLR Model to the Training Set
import numpy as np from sklearn.linear_model import
import pandas as pd LinearRegression
2) Importing the Dataset regressor=LinearRegression()
dataset=pd.read_csv('50_Startups.csv') regressor.fit(X_train,Y_train)
X=dataset.iloc[:,:-1].values 6) Predicting the Test Set Result
Y=dataset.iloc[:,4].values Y_pred=regressor.predict(X_test)
3) Encoding Categorical Data
3.1) Encoding the Independant Variable
3.2) Avoiding the Dummy Variable Trap
X=X[:,1:]
4) Splitting the dataset into the Training Set and Test Set
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,
The Backward Elimination method
Coun R& D Spend Admin Marketing
0 1 165,349.201 136,897.80 471,784.10
0 0 162,597.70 151,377.59 443,898.53
1 0 153,441.51 101,145.55 407,934.54
0 1 144,372.41 118,671.85 383,199.62
. . . . .
. . . . .
np.set_printoptions(threshold=np.nan)
1 0 1315.46 115816 297114
0 0 0 135427 0
0 1 542.05 51743.2 0
0 0 0 116984 45173.1
We have implemented the Multiple Linear Regresion model that we have fitted to the training
set.
Do we have an optimal
model?
No, when we have built the model, we have used all the Independent Variables.
What is happening if among these varaibles there are some highly statisticaly significant
varaibles (that is, variable that have a great impact or effect on the Dependent Variable profit)
and some other without influence on the Dependent Variable?
If we remove these later variables (without influence), we will certainly have a much more
significant model.
Find an optimal team of IVs such that each one has a great impact on the
DV. This effect can be positif (i.e., when this variable increases the profit Y
increases) or negatif (i.e., when it increases, the DV Y also decreases).
For the backward Elimination, we need to import the Library stats.models.formula.api
import statsmodels.api as sm
Multiple Linear
Regression Y=+ +
The constant is not associated to any independent Variables, we need to add a vector column
(with 50 lines) to the matrix of features X, the coefficients of this column are equal to 1. Hence,
the equation becomes
Y= + +
X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)
Without gives error
This, add a column in the end of the initial Axis = 1, because we need to
matrix X, but in our case, we need to add add a column, if we need to
the column in the begining of the matrix X add a line we put axis=0
X=np.append(arr=np.ones((50,1)).astype(int), values=X,axis=1)
The Backward Elimination consists of incuding all the Independent Variables at first and then we
will remove one by one the Independent Variables that are not statistical significant
We start by taking all the lines and all the columns
X_opt=X[:,[0,1,2,3,4,5]]
In Step 2 of the Backward Elimination, we need to fit X but we have introduced a new Library,
then we will fit X to a new regressor model related to the new Library. So we create the
‘regressor’ object of the OLS class
regressor_OLS=sm.OLS()
There are two parameters for the OLS class: endog fro the Dependent Variable and exog for the
Independent Variables X_opt
regressor_OLS=sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
Step 2 of the Backward Elimination is made. Now, we look the p-values of the IVs
Regressor_OLS.summary()
This command gives the table of the statistical informations of the model ( VIs, R-squared,
Adjusted r-squared, AIC BIC, P-values,…)
The constant has always a nul P-value
We look for the highest P-value, which is above the Significant Level 0.05, then we go to Step 4,
and we remove the associated Independent Variable and then we fit the model without this
variable
We repeat this action until, we find that the highest P-value is not higher than Significant Level
0.05, we deduce that while the model is now ready
6) Building the Optimal Model Using Backward Elimination
import statsmodels.formula.api as sm
X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)
(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt=X[:,[0,1,2,3,4,5]] 1 0 1 165349 136898 471748
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit() 1 0 0 162598 151378 443899

1 1 0 153442 101146 407935
regressor_OLS.summary()
1 0 1 144372 118672 383200
1 1 0 142107 91391.8 366168
1 0 1 131877 99814.7 362861
. . . . . .
1 0 1 542.05 51743.2 0
1 0 0 0 116984 45173.1
X_opt=X[:,[0,1,2,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
OLS Regression Results
R-squared: 0.951
Adj. R-squared: 0.945
F-statistic: 169.9
Prob (F-statistic): 1.34e-27
Log-Likelihood: -525.38 AIC: 1063. BIC: 1074 .
Coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------------
Const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
X1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
X2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
X3 0.8060 0.046 17.369 0.000 0.712 0.900
X4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
X5 0.0270 0.017 1.574 0.123 -0.008 0.062
X_opt=X[:,[0,1,3,4,5]]

R-squared: 0.951
F-statistic: 217.2
Log-Likelihood: -525.38
AIC: 1061.
BIC: 1070.
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Const 5.011e+04 6647.870 7.537 0.000 3.67e+04 6.35e+04
X1 220.1585 2900.536 0.076 0.940 -5621.821 6062.138
X2 0.8060 0.046 17.606 0.000 0.714 0.898
X3 -0.0270 0.052 -0.523 0.604 -0.131 0.077
X4 0.0270 0.017 1.592 0.118 -0.007 0.061
X_opt=X[:,[0,3,4,5]]
R-squared: 0.951
F-statistic: 296.0
AIC: 1059
BIC: 1066
coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------
Const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
X1 0.8057 0.045 17.846 0.000 0.715 0.897
X2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
X3 0.0272 0.016 1.655 0.105 -0.006 0.060
X_opt=X[:,[0,3,5]]
R-squared: 0.950
F-statistic: 450.8
AIC: 1057.
BIC: 1063.
coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------
Const 4.698e+04 2689.933 17.464 0.000 4.16e+04 5.24e+04
X1 0.7966 0.041 19.266 0.000 0.713 0.880
X2 0.0299 0.016 1.927 0.060 -0.001 0.061
X_opt=X[:,[0,3]]

R-squared: 0.947
F-statistic: 849.8
AIC: 1059.
BIC: 1063.
coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------
Const 4.903e+04 2537.897 19.320 0.000 4.39e+04 5.41e+04
X1 0.8543 0.029 29.151 0.000 0.795 0.913
6) Building the Optimal Model Using Backward Elimination
import statsmodels.api
import statsmodels.formula.api
as sm as sm
X=np.append(arr=np.ones((50,1)).astype(int), values=X,axis=1)
X_opt=X[:,[0,1,2,3,4,5]]
X_opt=X[:,[0,1,3,4,5]]
X_opt=X[:,[0,3,4,5]]
X_opt=X[:,[0,3,5]]
X_opt=X[:,[0,3]]

Data Science Chapitre 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Chapitre 2

Uploaded by

Copyright:

Available Formats

Data Science

3. Classification 4. Clustering 5. Dimensionality Reduction

6. Reinforcement 7. Natural langage Processing 8. Deep

X is called the Independent Variable (IV)

Why is it called linear?

Dependent Variable (DV) Independent Variable (IV)

The Experience is the x-axis and the Salary is the y-axis

The coefficient is the slope of the line

The line closest to the observations

from sklearn.linear_model import LinearRegression

Without parameters because it

We can give predictions

We can visualize the results with matplotlib

We take our ‘regressor’ We use a new method ‘predict’

Visualize the result of the Simple Rinear regression

Verify that our model is a good linear model

C2: The error in question is related to the variable Y (Ymeasured=Yreal + ε) ε: error

The values range between -1.0 and 1.0

A correlation of 0.0 shows no linear relationship between the two variables.

What is the R² coefficient? How to use the R² coefficient ?

Ordinary Least Square Method

Total Sum Regression Sum Error Sum

SST SSR= SSE

SSE is also called the Residual Sum of Squares

: Observed Values of the dependent variable Y

Ȳ: Average value of the dependent variable Y

Y given the value of X

The SLR model is perfect ^i

The SLR model is perfect 0 SST ≈ SSR R² ≈ 1

from sklearn.metrics import r2_score

Dependent Variable (DV)

Grade of a student in an exam =+ +

number of hours spent studying

Independence of the errors

192,261.83 165,349.201 136,897.80 471,784.10 New York

191,792.06 162,597.70 151,377.59 443,898.53 California

191,050.39 153,441.51 101,145.55 407,934.54 California

182,901.99 144,372.41 118,671.85 383,199.62 New York

166,187.94 142,107.34 91,391.77 366,168.42 California

DV Y The Independent Variables used to predict the DV Y

192,261.83 165,349.201 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1

191,050.39 153,441.51 101,145.55 407,934.54 California 0 1

182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0

166,187.94 142,107.34 91,391.77 366,168.42 California 0 1

We duplicate a variable Problem of multicollinearity

Forward Selection Stepwise Regression

 Prior knowledge: You already have knowledge of your

 You have to (you have no choice) to build the model

 Prepare for the Backward Elimination method

STEP 2: Fit the full model with all possible predictors

FIN: Your Model is

FIN: Keep the Previous Model

STEP 1: Select a significance level to enter and to stay in the model

STEP 2: Perform the next step of Forward Selection

FIN: Your Model is Ready

STEP 3: Select the one with the best criterion AIC=2k-ln(L)

FIN: Your Model is Ready 10 columns means 1023 models

Numerical: Regression Model

The Dependent Variable Y

: State New York, California and Florida

Choose the good working file. It contains the file 50_Start-

Save your file Python in the same working

We go back to the folder “data_preprocessing.py”