You are on page 1of 98

Data Science

Mahdi Louati
3 GLID
October, 11th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models

0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models

3. Classification 4. Clustering 5. Dimensionality Reduction


Models
3.1. Logistic Regression 4.1. K-Means Clustering 5.1. Principal Component Analysis (PCA)
3.2. K-Nearest Neighbors 4.2. Hierarchical Clustering 5.2. Linear Discriminant Analysis (LDA)
3.3. Support Vector Machine (SVM) 5.3. Kernel PCA
3.4. Kernel SVM
3.5. Naïve Bayes
3.6. Decision Tree classification

6. Reinforcement 7. Natural langage Processing 8. Deep


Learning
6.1. Upper confidence Boundary (UCB) (NLP) Learning
8.1. Artificial Neural Networks
6.2. Thompson Sampling 8.2 Convolution Neural Networks
Section 2

02 Regression Models
2.1. Simple Linear Regression
(SLR)
2.2. Multiple Linear Regression
(MLR)
2.3. Polynomial Regression
2.4. Support Vector Regression
(SVR)
2.5. Decision Tree Regression
The regression models are used to predict a continuous real value, such as salary, the price of an
apartment, etc. if your independent variable is time, then your model predicts future values.
Otherwise, your model predicts unknown values.
2. Regression
Models
2.1. Simple Linear Regression
(SLR)
Linear regression is the simplest and most preferred predictive model when it works because it is
easily interpretable unlike nonlinear models that keep their secrets if we stick to their
coefficients.
We will discover the basics of the Simple Linear Regression (SLR)
model
In SLR model, we predict the value of one variable
Y based on another variable X

X is called the Independent Variable (IV)


and Y is the Dependent Variable (DV)

Why is it called
simple ?
It examines the relationship between only two variables

Why is it called linear?


Experience Salary

0 1.1 39343
import pandas as pd
dataset=pd.read_csv(‘Salary_Data.csv’ 1 1.3 46205
)
X=dataset.iloc[:,:-1].values
2 2.0 43525
Y=dataset.iloc[:,-1].values 3 2.2 39891
4 2.9 56642
import matplotlib.pyplot as plt 5 3.0 60150
import pandas as pd
dataset=pd.read_csv(‘Salary_Data’)
6 3.7 57189
X=dataset.iloc[:,:-1] 7 4.5 61111
Y=dataset.iloc[:,-1]
plt.scatter(X,y,color='red') 8 5.1 66029
plt.title('Salary vs Experience')
plt.xlabel('Years of experience')
9 6.0 93940
plt.ylabel('Salary') 10 7.1 98273
plt.show()
11 9.0 1055
12 9.5 11696
To obtain a relationship between the years of experience and the salary which is a mathematical
model. We need to find a straignt line that could best represent the data points

Dependent Variable (DV) Independent Variable (IV)


Simple Linear
Y=+
Regression

Constant Coefficient
whereis the the slope of the line and is the Y-intercept, when the coefficinatsandare determined
you have obtained the SLR model

+ * Experience
Note of a student = + * Number of working hours
There is a correlation between the IV X and the DV
is the value of the Y when X=0

The Experience is the x-axis and the Salary is the y-axis

+ * Experience

The coefficient is the slope of the line


How the salary evolves according to the number of years of experience ?

The SLR is used to model the salary progression by the number of years of work. It is based on
real observations that have occurred and that we want to integrate into our model.

The line closest to the observations


Ordinary Least Square Method
How does the SLR model find the best line that approaches all observations?

S=

Min(S)

The linear regression takes all the possible line and keeps the one that minimizes S and
it is actually the best line (that is the ordinary least square method).
Implementation
Experience Salary
0 1.1 39343
Importing the Libraries
1 1.3 46205
import numpy as np
import matplotlib.pyplot as plt 2 2.0 43525
3 2.2 39891
import pandas as pd 4 2.9 56642
Importing the dataset
5 3.0 60150
dataset=pd.read_csv('Data.csv')
(‘Salary_Data.csv’
X=dataset.iloc[:,:-1].values
)
6 3.7 57189
Y=dataset.iloc[:,3].values 7 4.5 61111
8 5.1 66029
Taking care of missing values
9 6.0 93940
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
10 7.1 98273
imputer=imputer.fit(X[:,1:3]) 11 9.0 1055
X[:,1:3]=imputer.transform(X[:,1:3]) 12 9.5 11696
Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
labelencoder_Y=LabelEncoder()
Y=labelencoder_Y.fit_transform(Y)
Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split
(X,Y, test_size=1.0/3
test_size=0.2 ,random_state=0)
test_size=1/3
Feature scaling
from sklearn.preprocessing
the Dependent import
Variable StandardScaler
expresses itself as a linear combination of the
sc=StandardScaler()
Independent Variables and so even if the variables are not under the
X_train=sc.fit_transform(X_train)
same scale, the quantities (coeff * VI) will be under the same scale
X_test=sc.fit_transform(X_test)
Import the library The model The class
scikit-learn linear_model LinearRegression

from sklearn.linear_model import LinearRegression

Without parameters because it


Create an object of this class ‘regressor‘ is the more simple model

regressor=LinearRegression()

Fit our object ‘regressor’ to the training set using the fit method

regressor.fit(X_train,y_train)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)

The ordinary least squares method will execute to find the optimal coefficients so that the linear
regression line will be closest to all points in terms of minimum of the sum of the square
distances between the observation points and the prediction points.

We can give predictions

We can visualize the results with matplotlib


Prediction
s
The test set is composed by new observations

We take our ‘regressor’ We use a new method ‘predict’

y_pred=regressor.predict(X_test)

y_pred ≈ y_test

We can also make new predictions on observations outside the test set

regressor.predict(15)
Visualization
Comparison on the same graph of the observation points and the prediction points

Visualize the result of the Simple Rinear regression

Verify that our model is a good linear model

Use the library matplotlib The module pyplot Use the function‘scatter’

plt.scatter(X_train,y_train,color=‘red’)

For the prediction points, we would like to give the line and not the points then we don’t use the
function ‘scatter’ but the function ‘plot’

plt.plot(X_train,regressor(X_train),color=‘blue’)
We can give a title for the graph

plt.title(‘Salary vs experience’)

We can add

plt.xlabel(‘Experience’)

plt.ylabel(‘Salary’)
To display the graph

plt.show()
plt.scatter(X_train,y_train,color=‘red’)
plt.plot(X_train,regressor.predict(X_train),color=‘blue’)
plt.title(‘Salary vs experience’)
plt.xlabel(‘Experience’)
plt.ylabel(‘Salary’)
plt.show()
Errors of
Prediction
from sklearn import metrics
MAE= metrics.mean_absolute_error(Y_pred,y_test)
MSE= metrics.mean_squared_error(Y_pred,y_test)
RMSE= metrics.mean_squared_error(Y_pred,y_test)**0.5

MAE=
3737.4178618788987
MSE =
23370078.800832972
RMSE =
4834.260936361728
Is it a significant
model ?
Criteria
C1: X is known without errors ( Xreal=Xmeasured)

C2: The error in question is related to the variable Y (Ymeasured=Yreal + ε) ε: error

C3: The error is independent of the variable X (i.e., V(ε) is consatnt and does not dependent on X
(Property of Homoscedasticity)

C4: There is an average linear relation between X and Y (i.e., knowing X=x, E(Y)=αx+β)
Examples
The correlation coefficient is a measure of the linear relationship between two variables.

r(X,Y)=

The values range between -1.0 and 1.0

The correlation coefficient calculates the strength of the linear relationship between X and Y.

A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a
perfect positive correlation.

A correlation of 0.0 shows no linear relationship between the two variables.


x=dataset.iloc[:,0].values
r=(np.mean(x*Y)-np.mean(x)*np.mean(Y))/(np.std(x)*np.std(Y))
sigma=np.corrcoef(x,Y)
r(Experience, Salary) = 0.9782416184887603 Strong linear correlation

Y=+

= and = -

= 9449.962321455081

= 25792.200198668666
Coefficient of
Determination
The R-Squared (R²) also called the coefficient of determinantion is a useful important parameter
in statistics

What is the R² coefficient? How to use the R² coefficient ?

min

Ordinary Least Square Method


SST = SSR + SSE

Total Sum Regression Sum Error Sum


of Squares of Squares of Squares

SST SSR= SSE

SSE is also called the Residual Sum of Squares

: Observed Values of the dependent variable Y

Ȳ: Average value of the dependent variable Y

Y given the value of X


SST > 0

The SLR model is perfect ^i


for all 1 ≤ i ≤ n , Y i = Y SST = SSR

R² = = 1-

The SLR model is perfect 0 SST ≈ SSR R² ≈ 1

R² measures the quality of the SLR model compared to the average of the observations

0< R² <1

from sklearn.metrics import r2_score


r2_score(Y, Y_pred) 0.9569566641435086
2. Regression
Models
2.2. Multiple Linear Regression
(MLR)
In the original version of Linear Regression that we developed, we have a single feature X, the
size of the house, and we wanted to use that to predict why the price of the house and this was
our form of our hypothesis. But now imagine, what if we had not only the size of the house as a
feature or as a variable of which to try to predict the price, but that we also knew the number of
bedrooms, the number of house and the age of the home and years. It seems like we would give
more information with to predict correctly the price. i.e., we show how simple linear regression
can be extended to accommodate multiple input features and we discuss best practices for
implementing this new model.
Dependent Variable (DV)
Independent variable (IV)

Simple Linear
Regression Y=+

Constant Coefficient
Coefficients
Multiple Linear
Regression Y=+ +

Dependent Variable (DV)


Independent Variables (IVs)
Years of Experience level of study
Salary
Years in the Company Number of Children

Constant Coefficients
the value of Y when = =…= =0

Grade of a student in an exam =+ +

number of hours spent studying

Number of hours of sleep just before the exam number of revised chapters
Warnin
g
Assumptions of the Linear
Regression
If for one of your projects you need to build a robust Linear Regression model, you have to check
these five properties

Exogeneity: The IVs X are not dependent on the DV Y i.e., This does not mean that there is no
VIs connection, since Y is dependent, it will
still depend on the IVs X and on the error
Lack of multicollinearity of the VIs
term.

Homoscedasticity: The error term is the same across all values of the IVs
Errors Multivariate normality of the errors

Independence of the errors

The Linear Regression is not our last round, it is just an intermidiary step before we start
creating powerful new models
The DUMMY variables
Profit R& D Spend Admin Marketing State

192,261.83 165,349.201 136,897.80 471,784.10 New York

191,792.06 162,597.70 151,377.59 443,898.53 California

191,050.39 153,441.51 101,145.55 407,934.54 California

182,901.99 144,372.41 118,671.85 383,199.62 New York

166,187.94 142,107.34 91,391.77 366,168.42 California

DV Y The Independent Variables used to predict the DV Y


Dummy Variables
𝑋 1
𝑋2 𝑋3 𝑋4
Profit R& D Spend Admin Marketing State NewYork
New York California
California

192,261.83 165,349.201 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1

191,050.39 153,441.51 101,145.55 407,934.54 California 0 1

182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0

166,187.94 142,107.34 91,391.77 366,168.42 California 0 1

𝑏𝐶 𝐷1 𝐷2
Y=+ + 44 𝑋
𝐷411 + 𝐷
𝑏5 𝐷
12
𝐷1 + 𝐷2 =1
are dependent
Always omit one dummy variable
What happens if we introduce in our model the two DUMMY variables New York and California?

We duplicate a variable Problem of multicollinearity

The model can not distinguish the effect of from the effect of

Mathematically, you can not have the quantities at the same time

That is the trap of dummy variables Always omit one dummy variable
Building a Model (Step-By-Step)
𝑿𝟐 𝑿𝟑 𝑿𝟒

𝑿𝟏 Y 𝑿5

𝑿𝟕
Why do we need to get rid of some IVs?
Why we will not use them all in the model?
Garbage in, Garbage out (GIGO): If you throw a lot of variables from
your model in the trash then your model will not be reliable and may
not be a good model because it will not do what it is supposed to do
or at least it is less likely to run correctly.

If you include all the independent variables in your model, you must
explain the impact of each variable on Y which is very difficult and
expensive to do if the number of these variables is important.

Keep only the most important variables that have more impact on the Dependent Variable.
How to build a
model
There is Five methods of building
models:
All-in

Backward Elimination

Forward Selection Stepwise Regression

Bidirectional Elimination

Score Comparison
All-in
Integrate all the independent variables you have in
the multiple regression model

 Prior knowledge: You already have knowledge of your


model

 You have to (you have no choice) to build the model


with all these variables (expert opinion)

 Prepare for the Backward Elimination method


The P-Value
In statistics, every conjecture concerning the unknown distribution, parameter, … is called
a statistical hypothesis.

The Methods of verifying a statistical hypothesis are called statistical tests. In fact, a statistical
test is a decision-making: accept or reject the hypothesis. (i.e., the aim of a statistical test is to
verify whether this hypothesis is not false).

The P-value is one of the statistical test developped by the statistician Ronald Fisher (1890-1962)
used to quantify the statistical significance under a null hypothesis testing which is
an argument adapted to statistics.  The general idea is to prove that the null hypothesis is not
verified because in the case where it would be the observed result would be highly improbable. It
is therefore an extension of the principle of proof by the absurd. In statistics, a result is said to be
statistically significant when the P-value is lower than the α probability of rejecting the null
hypothesis when it is true. The α probability is usually 0.05, but may vary by study and contex.

The P-value is the likelihood of the data assuming the null hypothesis .
Backward Elimination
STEP 1: Select a significance level to stay in the model (e.g. SL=0.05)

STEP 2: Fit the full model with all possible predictors

STEP 3: Consider the predictor with highest P-value. If P> SL, go to otherwise FIN
STEP 4,
STEP 4: Remove the
predictor
STEP 5: Fit the model without this variable.

FIN: Your Model is


Your model is ready when in STEP 3 youReady
can not find an
independent
Forward Selection
STEP 1: Select a significance level to enter the model (e.g. SL=0.05=5%)

STEP 2: Fit all simple regression model (Y~ ). Select the one with the lowest P-value

STEP 3: Keep this variable and fit all possible models with one extra predictor added to the
one(s) you already have

STEP 4: Consider the predictor with the lowest P-value. If P < SL go to otherwise FIN
STEP 3,

FIN: Keep the Previous Model


Your model is ready when in STEP 4 you can not find an independent variable with P-
value less than the significance level. Our model is the previous and not the last one.
Bidirectional Elimination (Stepwise Regression)

STEP 1: Select a significance level to enter and to stay in the model


(e.g. SLENTER=0.05=5% and SLSTAY= 0.05=5%)

STEP 2: Perform the next step of Forward Selection


(new variable must have P < SLENTER to enter)
STEP 3: Perform all steps of Backward Elimination (variables must have P< SLSTAY to stay)

STEP 4: No new variables can enter and no old variables can exit

FIN: Your Model is Ready


Your model is ready when you can not add a new independent variable in STEP 2 and delete an
old independent variable in STEP 3.
All Possible Models
STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)

STEP 2: Construct all possible linear regression models: (-1) total combinations

STEP 3: Select the one with the best criterion AIC=2k-ln(L)


k is the number of parameters to
estimate and L is the maximum of
the likelihood function of the model

FIN: Your Model is Ready 10 columns means 1023 models

It is the best method but it is the most resource-intensive method. In fact, if you have a lot of
variables this method is not recommended since it will take a lot of time and resource
Implementation
We will focus on the second method that is the Backward Elimination method because it is the
fastest and the most efficient method

Numerical: Regression Model

The Dependent Variable Y

Categorical: Classification
In this section, we deal with a business problem where we have to predict the profit of start-ups
based on several available informations. These informations will be our independent variables.
We have 50 observations.

X 1 : R & D Spend
: Administration Spend
Y: Profit
: Marketing Spend

: State New York, California and Florida

Choose the good working file. It contains the file 50_Start-


ups.csv

Save your file Python in the same working


Take care of the data preprocessing phase

We go back to the folder “data_preprocessing.py”

We take all sections of code and we run them one by one

We import the libraries that we use every time: Numpy, Pandas and matplotlib.pyplot
We change the dataset ‘50_Startups.csv’
Import the dataset and create the matrix of VIs
The DV Y is the last column

50 Start-ups Profit R & D Spend Administration Marketing State


we imagine that we are working for a foundation that wonders which star-up to invest and so to
help investors choose, we will build a model of Multiple Linear Regression that will be able to
understand the correlations between these data (the VIs and the DV)

There is no missing data Delete the section missing data from the code

Managemant of the categorical data State is composed by three categories

We take care of the index of the State’s


It is a nominal categorical data:
column
There is no order relation between
We replace 0 by 3 in the initial code
the different categories We
create three dummy variables one
for each category
LabelEncoder transforms the texts in numerical values 0;1 and
2
OneHotEncoder creates three dummy variables corresponding to New York, California and
Florida
The DV is not categorical (numerical) We don’t encode Y
Variable State after
Initial matrix X of VIs before transformations
transformations
R& D Spend Admin Marketing State State

165,349.201 136,897.80 471,784.10 New York 0 0 1

162,597.70 151,377.59 443,898.53 California 1 0 0

153,441.51 101,145.55 407,934.54 Florida 0 1 0

144,372.41 118,671.85 383,199.62 New York 0 0 1

142,107.34 91,391.77 366,168.42 Florida 0 1 0

131876.9 99814.71 362861.36 New York 0 0 1

134615.46 147198.87 127716.82 California 1 0 0

130298.13 145530.06 323876.68 Florida 0 1 0

120542.52 148718.95 311613.29 New York 0 0 1

123334.88 108679.17 304981.62 California 1 0 0

101913.08 110594.11 229160.95 Florida 0 1 0


State R& D Spend Admin Marketing
The matrix X of VIs contains six
0 0 1 165,349.201 136,897.80 471,784.10
columns.
1 0 0 162,597.70 151,377.59 443,898.53

0 1 0 153,441.51 101,145.55 407,934.54

0 0 1 144,372.41 118,671.85 383,199.62

The three dummy variables corresponding to the 0 1 0 142,107.34 91,391.77 366,168.42


state variable represent the first three column of 0 0 1 131876.9 99814.71 362861.36
X.
1 0 0 134615.46 147198.87 127716.82

0 1 0 130298.13 145530.06 323876.68


New York corresponds to the third
0 0 1 120542.52 148718.95 311613.29
column
California corresponds to the first column 1 0 0 123334.88 108679.17 304981.62
Florida corresponds to the second column 0 1 0 101913.08 110594.11 229160.95

We must avoid the trap of dummy


R& D Spend Admin Marketing
We have to remove one dummy variable
0 1 165,349.201 136,897.80 471,784.10

0 0 162,597.70 151,377.59 443,898.53

We add this line in the encoding section 1 0 153,441.51 101,145.55 407,934.54

0 1 144,372.41 118,671.85 383,199.62

1 0 142,107.34 91,391.77 366,168.42


X=X[:,1:] 0 1 131876.9 99814.71 362861.36

0 0 134615.46 147198.87 127716.82


We remove the first column (we take the columns 1 0 130298.13 145530.06 323876.68
of X from the second), we can choose any column
0 1 120542.52 148718.95 311613.29
(1-3)
0 0 123334.88 108679.17 304981.62

New York represents the second column and 1 0 101913.08 110594.11 229160.95
Florida represents the First column
We create the training set, we choose 20% of the observations for the test set and 80% for the
training (i.e., 10 observations for the test set and 40 observations for the training set)

We build the model on the correlation of the 40 observations of the training set and we will
establish new predictions on the 10 observations of the test set

The training set (40 observations) contains X_train and Y_train. The test set (10 observations) is
composed by X_test and Y_test

For the ‘Feature Scaling’ section, as in the Simple Linear Regression model, the coefficients can
adopt their scales so that the products are on the same scale

We don’t need the ‘Feature Scaling’ section

The data preprocessing phase is completed


1) Importing the Librairies
R&D Spend Administ Marketing
import numpy as np
import matplotlib.pyplot as plt 0 1 165,349.201 136,897.80 471,784.10
import pandas as pd 0 0 162,597.70 151,377.59 443,898.53
2) Importing the Dataset
1 0 153,441.51 101,145.55 407,934.54
dataset=pd.read_csv('50_Startups.csv')
X=dataset.iloc[:,:-1].values 0 1 144,372.41 118,671.85 383,199.62
Y=dataset.iloc[:,4].values 1 0 142,107.34 91,391.77 366,168.42
3) Encoding Categorical Data
from 0 1 131,876. 99,814.71 362,861.36
from sklearn.compose import
sklearn.preprocessing ColumnTransformer
import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder() 0 0 134,615.46 147,198.87 127,716.82
labelEncoder_X=LabelEncoder()
X[:,3]=labelEncoder_X.fit_transform(X[:,3]) 1 0 130,298.13 145,530.06 323,876.68
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
onehotencoder=OneHotEncoder(categorical_features=[3])
ct = ColumnTransformer([("State", OneHotEncoder(), 0 1[3])],
120,542.52 148,718.95
remainder = 'passthrough’)311,613.29
X=onehotencoder.fit_transform(X).toarray()
X = ct.fit_transform(X)
X=X[:,1:] 0 0 123,334.88 108,679.17 304,981.62
X=X[:,1:]
4) Splitting the dataset into the Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2, random_state=0)
In order to build the Multiple Linear Regression model, we only need to copy what we did in the
Simple Linear Regression model for the part of build the model. In the part make new
predictions, we take almost the same work modulo a small modification.

From the class we create the object We link the regressor to the training
‘LinearRegression’ ‘regressor’ set (X_train and Y_train)

5) Fitting Multiple Linear Regression to the Training


set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
In order to establish new predictions, we take the object ‘regressor’, then we use the method
predict which is applied to the test set
6) Predicting the Test set results State R&D Admin Marketing Y_test Y_pred
Y_pred = regressor.predict(X_test) 1 0 66051.5 182646 118148 103282 103015
0 0 100672 91790.6 249745 144259 132582
1 0 101913 110594 229161 146122 132448
1 0 27892.9 84710.8 164471 77798.8 71976.1
Y_pred ≈ Y_test 1 0 153442 101146 407935 191050 178537
0 1 72107.6 127865 353184 105008 116161
0 1 20229.6 65947.9 185265 81229.1 67851.7

It seems to be good linear correlations 0 1 61136.4 152702 88218.2 97483.6 98791.7

between the Profit and the VIs 1 0 73994.6 122783 303319 110352 113969
1 0 142107 91391.8 366168 166188 167921

The Multiple Linear Regression model is a significant


The Mean Squared Error (MSE) assesses the quality of a predictor (the model).

The Root Mean Square Error (RMSE) measures the differences between the predicted values by
the model and the observedvalues. It represents the square root of MSE.

MSE=

7) Model Evaluation
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)

RMSE= 9137.990152794797 The model is significant


How to make a new prediction?

Regressor.predict(15) We have several Independent Variables, we will do a changement


here.

We have to enter (in order) the values of the Independent Variables corresponding to the
information of the new start-up

Taking R&D Spend=130,000, Admin=140,000, Marketing=300,000 and State=New York

Actually, to enter the state we aren’t going to enter New York as a string of characters because
the predict method expects something like in X_test (i. e., the state variable isn’t in a single
column but in three dummy variables, so you have to enter the combinations of New York in 0
and 1).
New York is the third column in dummy variables then it corresponds to 0 (California) and 0
(Florida).
regressor.predict(0,1,130000,140000,300000)
It doesn’t work
When we have several Independent Variables, we must enter the new information in a tabular
form because this sequence of numbers (0,1,130000,140000,300000) has no sense for Python.
This latter wait the new information in line vector form. We will use the library Numpy (np)

regressor.predict(np.array(0,1,130000,140000,300000))
It doesn’t work
regressor.predict(np.array([[0,1,130000,140000,300000]]))

The information of the new observation


158,691.7540
is a line vector and not a column vector
8
Coefficient of Determination Adjusted R²
Does improve the model or not?
R² = = 1-

R² with
Y=+ R² after adding

Y=+ + +

Initial salary Years of experience Years of study Years of experience before


in the company joining the company

SSE min

R² mesures the quality of the prediction R² never decreases


If you add a new variable to your model, it will influence it in some way because adding a third
variable (for example) will help to minimize SSE (i.e., the Multiple Regression Linear model
will find a coefficientthat will further minimize SSE) or SSE remains the same and does not
change (i.e., the Multiple Regression Linear model will take the coefficient ≈ 0). So R² will either
increase or keep itself (almost)

Taking the example of the salary of an employee in a company, where is the experience years, is
the number of years of study and is the last number of his phone number

If the variable has nothing to do with the salary, there is even a slight correlation with it, linked
to chance, (i.e., the model will even consider it). So, since the correlation will be zero, the
coefficient associated to the variable will be almost zero. So after the addition of the variable the
model will either minimize SSE (very little) or keep its value

By adding a new variable, the increase of R² will still take place even if it is very small and that
is why we need a new measure that takes into account the number of independent variables and
measures the quality of the model that is the ‘adjusted R-square’ denoted by Adj R²
Penalty Coefficient
R² = = 1-

Adj R² = 1- (1-R²) p: The number of the independent variables


n: The size of the sample

Sum of Squares Definitions Notations Degree of Freedom

Total SST n-1

Regression SSR p

Error SSE n-p-1

constraint
The Adj R² penalizes you when you add a variable not corrolated to Y.

The number of independent variables appears at the denominator, when you add an independent
variable the penalty coefficient increases. This implies that the quantity (1-R²) increases and then
the Adj R² decreases.

Furthermore, when the number of the independent variable increases the R² increases. It follows
that the quantity (1-R²) decreases and then the Adj R² increases.

The penalty coefficient counterbalance this increase by reducing the Adj R².

if you add a regressor that improves your model (strongly correlated to Y) then R² will increase
considerably and it will considerably increase your Adj R² although the ratio will decrease.

Adj R² is an excellent metric. It is a very powerful tool to measure


the quality of your model.
Encoding the Independant Variable
from
from sklearn.compose
sklearn.preprocessingimport ColumnTransformer
import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
labelEncoder_X=LabelEncoder()
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
onehotencoder=OneHotEncoder(categorical_features=[3])
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough’)
X=onehotencoder.fit_transform(X).toarray()
X = ct.fit_transform(X)
X=X[:,1:5]
X=X[:,1:]
Fitting Multiple Linear Regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression() R² = 0.9485223547171563
regressor.fit(X,Y) Adj R² = 0.9426726222986513
Predicting and R²
Y_pred=regressor.predict(X)
from sklearn.metrics
SSE= sum((Y-Y_pred)**2) import r2_score
r2=r2_score(Y,
SST Y_pred)
= sum((Y-np.mean(Y))**2)
Adj_r2 = 1= -1 (1
r_squared - r2 ) * ((len(Y) - 1) / (len(Y) - X.shape[1]-1))
- SSE/SST
adjusted_r_squared = 1 - (1-r_squared)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
S=np.array([1,2,1,0,1,2,1,0,1,2,1,0,1,2,1,2,1,2,1,0,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,0,1,2,1,2,1,2,1,2,1,2,1,2,
1,0])

HT=np.corrcoef(S,Y) HT=-0.09997243 Weak correlation between S and Y

U=np.c_[X,S]

from sklearn.linear_model import LinearRegression


regressor=LinearRegression() R² = 0.9485223547171563
regressor.fit(U,Y)
Adj R² = 0.9426726222986513
Y_pred=regressor.predict(U)
from sklearn.metrics import r2_score
r2=r2_score(Y, Y_pred) R² = 0.9507757894286009
adjr2 = 1 - (1 - r2 ) * ((n - 1) / n- p - 1)) Adj R² = 0.9401821291363964
With n=50 and p=6
Coefficients of the MLR and Evaluation of the
model
Coeff Values

42554.16761772438
Y=+ +
-9.59284160e+02

6.99369053e+02
Constant=regressor.intercept_ 7.73467193e-01
Coefficients=regressor.coef_
3.28845975e-02

3.66100259e-02
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)

from sklearn import metrics MSE = 83502864.03257468


MAE= metrics.mean_absolute_error(Y_pred,y_test)
MSE= metrics.mean_squared_error(Y_pred,y_test) RMSE = 9137.990152794797
RMSE= metrics.mean_squared_error(Y_pred,y_test)**0.5
Y=+ +
Y(1)=+ (1)+
Y(2)=+ (2)+
…. …… ….. ….. ….. …… ……. ….. …….. ….. ….. ….. ……
……. ….... …….. …… ….. ….. ….. …… …… …… …… ….. …..
….. …… ……. ….. ……..

Y(i)=+ (i) + (i)

Y(1) 1 (1) (1) …..(1)


Y(2) 1(2) (2) …… (2)
Y= . …………..………………. . Y=AB+ε
= …………………………… . Y=AB+ ε
. 1(i) (i)…......(i)
Y(i) Y
Y

-1 -1 -4 1 -1 -1 4 0 0
-1 1 2 A= 1 -1 1 = 0 4 0
1 -1 0 1 1 -1 0 0 4
1 1 2 1 1 1

0 0
= 0 0
Y= 1
0 0
2

Y= + 2
Template Multiple Linear Regression
1) Importing the Librairies 5) Fitting MLR Model to the Training Set
import numpy as np from sklearn.linear_model import
import pandas as pd LinearRegression
2) Importing the Dataset regressor=LinearRegression()
dataset=pd.read_csv('50_Startups.csv') regressor.fit(X_train,Y_train)
X=dataset.iloc[:,:-1].values 6) Predicting the Test Set Result
Y=dataset.iloc[:,4].values Y_pred=regressor.predict(X_test)
3) Encoding Categorical Data
3.1) Encoding the Independant Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
onehotencoder=OneHotEncoder(categorical_features=[3])
X=onehotencoder.fit_transform(X).toarray()
3.2) Avoiding the Dummy Variable Trap
X=X[:,1:]
4) Splitting the dataset into the Training Set and Test Set
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,
The Backward Elimination method
Coun R& D Spend Admin Marketing

0 1 165,349.201 136,897.80 471,784.10

0 0 162,597.70 151,377.59 443,898.53

1 0 153,441.51 101,145.55 407,934.54

0 1 144,372.41 118,671.85 383,199.62

. . . . .

. . . . .
np.set_printoptions(threshold=np.nan)
1 0 1315.46 115816 297114

0 0 0 135427 0

0 1 542.05 51743.2 0

0 0 0 116984 45173.1
We have implemented the Multiple Linear Regresion model that we have fitted to the training
set.
Do we have an optimal
model?
No, when we have built the model, we have used all the Independent Variables.

What is happening if among these varaibles there are some highly statisticaly significant
varaibles (that is, variable that have a great impact or effect on the Dependent Variable profit)
and some other without influence on the Dependent Variable?

If we remove these later variables (without influence), we will certainly have a much more
significant model.

Find an optimal team of IVs such that each one has a great impact on the
DV. This effect can be positif (i.e., when this variable increases the profit Y
increases) or negatif (i.e., when it increases, the DV Y also decreases).
For the backward Elimination, we need to import the Library stats.models.formula.api

import statsmodels.api as sm

Multiple Linear
Regression Y=+ +

The constant is not associated to any independent Variables, we need to add a vector column
(with 50 lines) to the matrix of features X, the coefficients of this column are equal to 1. Hence,
the equation becomes

Y= + +
X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)

Without gives error

This, add a column in the end of the initial Axis = 1, because we need to
matrix X, but in our case, we need to add add a column, if we need to
the column in the begining of the matrix X add a line we put axis=0

X=np.append(arr=np.ones((50,1)).astype(int), values=X,axis=1)

The Backward Elimination consists of incuding all the Independent Variables at first and then we
will remove one by one the Independent Variables that are not statistical significant
We start by taking all the lines and all the columns

X_opt=X[:,[0,1,2,3,4,5]]
In Step 2 of the Backward Elimination, we need to fit X but we have introduced a new Library,
then we will fit X to a new regressor model related to the new Library. So we create the
‘regressor’ object of the OLS class

regressor_OLS=sm.OLS()
There are two parameters for the OLS class: endog fro the Dependent Variable and exog for the
Independent Variables X_opt

regressor_OLS=sm.OLS(endog=Y,exog=X_opt).fit()

regressor_OLS=sm.OLS(Y,X_opt).fit()
Step 2 of the Backward Elimination is made. Now, we look the p-values of the IVs

Regressor_OLS.summary()

This command gives the table of the statistical informations of the model ( VIs, R-squared,
Adjusted r-squared, AIC BIC, P-values,…)

The constant has always a nul P-value

We look for the highest P-value, which is above the Significant Level 0.05, then we go to Step 4,
and we remove the associated Independent Variable and then we fit the model without this
variable

We repeat this action until, we find that the highest P-value is not higher than Significant Level
0.05, we deduce that while the model is now ready
6) Building the Optimal Model Using Backward Elimination
import statsmodels.formula.api as sm

X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)
(arr=np.ones((50,1)).astype(int),values=X,axis=1)

X_opt=X[:,[0,1,2,3,4,5]] 1 0 1 165349 136898 471748

regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit() 1 0 0 162598 151378 443899


regressor_OLS=sm.OLS(Y,X_opt).fit()
1 1 0 153442 101146 407935
regressor_OLS.summary()
1 0 1 144372 118672 383200

1 1 0 142107 91391.8 366168

1 0 1 131877 99814.7 362861

. . . . . .

1 0 1 542.05 51743.2 0

1 0 0 0 116984 45173.1
X_opt=X[:,[0,1,2,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
R-squared: 0.951
Adj. R-squared: 0.945
F-statistic: 169.9
Prob (F-statistic): 1.34e-27
Log-Likelihood: -525.38 AIC: 1063. BIC: 1074 .

Coef std err t P>|t| [0.025 0.975]


-----------------------------------------------------------------------------------------
Const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
X1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
X2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
X3 0.8060 0.046 17.369 0.000 0.712 0.900
X4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
X5 0.0270 0.017 1.574 0.123 -0.008 0.062
X_opt=X[:,[0,1,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()

OLS Regression Results


R-squared: 0.951
Adj. R-squared: 0.946
F-statistic: 217.2
Prob (F-statistic): 8.49e-29
Log-Likelihood: -525.38
AIC: 1061.
BIC: 1070.
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Const 5.011e+04 6647.870 7.537 0.000 3.67e+04 6.35e+04
X1 220.1585 2900.536 0.076 0.940 -5621.821 6062.138
X2 0.8060 0.046 17.606 0.000 0.714 0.898
X3 -0.0270 0.052 -0.523 0.604 -0.131 0.077
X4 0.0270 0.017 1.592 0.118 -0.007 0.061
X_opt=X[:,[0,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
R-squared: 0.951
Adj. R-squared: 0.948
F-statistic: 296.0
Prob (F-statistic): 4.53e-30
Log-Likelihood: -525.39
AIC: 1059
BIC: 1066

coef std err t P>|t| [0.025 0.975]


------------------------------------------------------------------------------
Const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
X1 0.8057 0.045 17.846 0.000 0.715 0.897
X2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
X3 0.0272 0.016 1.655 0.105 -0.006 0.060
X_opt=X[:,[0,3,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
R-squared: 0.950
Adj. R-squared: 0.948
F-statistic: 450.8
Prob (F-statistic): 2.16e-31
Log-Likelihood: -525.54
AIC: 1057.
BIC: 1063.

coef std err t P>|t| [0.025 0.975]


------------------------------------------------------------------------------
Const 4.698e+04 2689.933 17.464 0.000 4.16e+04 5.24e+04
X1 0.7966 0.041 19.266 0.000 0.713 0.880
X2 0.0299 0.016 1.927 0.060 -0.001 0.061
X_opt=X[:,[0,3]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()

OLS Regression Results


R-squared: 0.947
Adj. R-squared: 0.945
F-statistic: 849.8
Prob (F-statistic): 3.50e-32
Log-Likelihood: -527.44
AIC: 1059.
BIC: 1063.

coef std err t P>|t| [0.025 0.975]


------------------------------------------------------------------------------
Const 4.903e+04 2537.897 19.320 0.000 4.39e+04 5.41e+04
X1 0.8543 0.029 29.151 0.000 0.795 0.913
6) Building the Optimal Model Using Backward Elimination
import statsmodels.api
import statsmodels.formula.api
as sm as sm
X=np.append(arr=np.ones((50,1)).astype(int), values=X,axis=1)
X_opt=X[:,[0,1,2,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
X_opt=X[:,[0,1,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
X_opt=X[:,[0,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
X_opt=X[:,[0,3,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
X_opt=X[:,[0,3]]
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS.summary()

You might also like