Professional Documents
Culture Documents
Mahdi Louati
3 GLID
October, 11th 2022
Conten
u
0. Welcome to Machine
Learning
1. Data
Preprocessing
2. Regression Models
0.1 Why Machine Learning is the Future 1.1. Importing the Librairies 2.1. Simple Linear Regression (SLR)
0.2. What is machine Learning 1.2. Importing the Dataset 2.2. Multiple Linear Regression (MLR)
0.3. Installing Python and Anaconda 1.3. Missing Data 2.3. Polynomial Regression
1.4. Categorical Data 2.4. Support Vector Regression (SVR)
1.5. Training Set and Test Set 2.5. Decision Tree Regression
1.6. Feature Scaling 2.6. Random Forest Regression
2.7. Evaluation Regression Models
02 Regression Models
2.1. Simple Linear Regression
(SLR)
2.2. Multiple Linear Regression
(MLR)
2.3. Polynomial Regression
2.4. Support Vector Regression
(SVR)
2.5. Decision Tree Regression
The regression models are used to predict a continuous real value, such as salary, the price of an
apartment, etc. if your independent variable is time, then your model predicts future values.
Otherwise, your model predicts unknown values.
2. Regression
Models
2.1. Simple Linear Regression
(SLR)
Linear regression is the simplest and most preferred predictive model when it works because it is
easily interpretable unlike nonlinear models that keep their secrets if we stick to their
coefficients.
We will discover the basics of the Simple Linear Regression (SLR)
model
In SLR model, we predict the value of one variable
Y based on another variable X
Why is it called
simple ?
It examines the relationship between only two variables
0 1.1 39343
import pandas as pd
dataset=pd.read_csv(‘Salary_Data.csv’ 1 1.3 46205
)
X=dataset.iloc[:,:-1].values
2 2.0 43525
Y=dataset.iloc[:,-1].values 3 2.2 39891
4 2.9 56642
import matplotlib.pyplot as plt 5 3.0 60150
import pandas as pd
dataset=pd.read_csv(‘Salary_Data’)
6 3.7 57189
X=dataset.iloc[:,:-1] 7 4.5 61111
Y=dataset.iloc[:,-1]
plt.scatter(X,y,color='red') 8 5.1 66029
plt.title('Salary vs Experience')
plt.xlabel('Years of experience')
9 6.0 93940
plt.ylabel('Salary') 10 7.1 98273
plt.show()
11 9.0 1055
12 9.5 11696
To obtain a relationship between the years of experience and the salary which is a mathematical
model. We need to find a straignt line that could best represent the data points
Constant Coefficient
whereis the the slope of the line and is the Y-intercept, when the coefficinatsandare determined
you have obtained the SLR model
+ * Experience
Note of a student = + * Number of working hours
There is a correlation between the IV X and the DV
is the value of the Y when X=0
+ * Experience
The SLR is used to model the salary progression by the number of years of work. It is based on
real observations that have occurred and that we want to integrate into our model.
S=
Min(S)
The linear regression takes all the possible line and keeps the one that minimizes S and
it is actually the best line (that is the ordinary least square method).
Implementation
Experience Salary
0 1.1 39343
Importing the Libraries
1 1.3 46205
import numpy as np
import matplotlib.pyplot as plt 2 2.0 43525
3 2.2 39891
import pandas as pd 4 2.9 56642
Importing the dataset
5 3.0 60150
dataset=pd.read_csv('Data.csv')
(‘Salary_Data.csv’
X=dataset.iloc[:,:-1].values
)
6 3.7 57189
Y=dataset.iloc[:,3].values 7 4.5 61111
8 5.1 66029
Taking care of missing values
9 6.0 93940
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='NaN',strategy='mean',axis=0)
10 7.1 98273
imputer=imputer.fit(X[:,1:3]) 11 9.0 1055
X[:,1:3]=imputer.transform(X[:,1:3]) 12 9.5 11696
Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
X[:,0]=labelEncoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
labelencoder_Y=LabelEncoder()
Y=labelencoder_Y.fit_transform(Y)
Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split
(X,Y, test_size=1.0/3
test_size=0.2 ,random_state=0)
test_size=1/3
Feature scaling
from sklearn.preprocessing
the Dependent import
Variable StandardScaler
expresses itself as a linear combination of the
sc=StandardScaler()
Independent Variables and so even if the variables are not under the
X_train=sc.fit_transform(X_train)
same scale, the quantities (coeff * VI) will be under the same scale
X_test=sc.fit_transform(X_test)
Import the library The model The class
scikit-learn linear_model LinearRegression
regressor=LinearRegression()
Fit our object ‘regressor’ to the training set using the fit method
regressor.fit(X_train,y_train)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
The ordinary least squares method will execute to find the optimal coefficients so that the linear
regression line will be closest to all points in terms of minimum of the sum of the square
distances between the observation points and the prediction points.
y_pred=regressor.predict(X_test)
y_pred ≈ y_test
We can also make new predictions on observations outside the test set
regressor.predict(15)
Visualization
Comparison on the same graph of the observation points and the prediction points
Use the library matplotlib The module pyplot Use the function‘scatter’
plt.scatter(X_train,y_train,color=‘red’)
For the prediction points, we would like to give the line and not the points then we don’t use the
function ‘scatter’ but the function ‘plot’
plt.plot(X_train,regressor(X_train),color=‘blue’)
We can give a title for the graph
plt.title(‘Salary vs experience’)
We can add
plt.xlabel(‘Experience’)
plt.ylabel(‘Salary’)
To display the graph
plt.show()
plt.scatter(X_train,y_train,color=‘red’)
plt.plot(X_train,regressor.predict(X_train),color=‘blue’)
plt.title(‘Salary vs experience’)
plt.xlabel(‘Experience’)
plt.ylabel(‘Salary’)
plt.show()
Errors of
Prediction
from sklearn import metrics
MAE= metrics.mean_absolute_error(Y_pred,y_test)
MSE= metrics.mean_squared_error(Y_pred,y_test)
RMSE= metrics.mean_squared_error(Y_pred,y_test)**0.5
MAE=
3737.4178618788987
MSE =
23370078.800832972
RMSE =
4834.260936361728
Is it a significant
model ?
Criteria
C1: X is known without errors ( Xreal=Xmeasured)
C3: The error is independent of the variable X (i.e., V(ε) is consatnt and does not dependent on X
(Property of Homoscedasticity)
C4: There is an average linear relation between X and Y (i.e., knowing X=x, E(Y)=αx+β)
Examples
The correlation coefficient is a measure of the linear relationship between two variables.
r(X,Y)=
The correlation coefficient calculates the strength of the linear relationship between X and Y.
A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a
perfect positive correlation.
Y=+
= and = -
= 9449.962321455081
= 25792.200198668666
Coefficient of
Determination
The R-Squared (R²) also called the coefficient of determinantion is a useful important parameter
in statistics
min
R² = = 1-
R² measures the quality of the SLR model compared to the average of the observations
0< R² <1
Simple Linear
Regression Y=+
Constant Coefficient
Coefficients
Multiple Linear
Regression Y=+ +
Constant Coefficients
the value of Y when = =…= =0
Number of hours of sleep just before the exam number of revised chapters
Warnin
g
Assumptions of the Linear
Regression
If for one of your projects you need to build a robust Linear Regression model, you have to check
these five properties
Exogeneity: The IVs X are not dependent on the DV Y i.e., This does not mean that there is no
VIs connection, since Y is dependent, it will
still depend on the IVs X and on the error
Lack of multicollinearity of the VIs
term.
Homoscedasticity: The error term is the same across all values of the IVs
Errors Multivariate normality of the errors
The Linear Regression is not our last round, it is just an intermidiary step before we start
creating powerful new models
The DUMMY variables
Profit R& D Spend Admin Marketing State
𝑏𝐶 𝐷1 𝐷2
Y=+ + 44 𝑋
𝐷411 + 𝐷
𝑏5 𝐷
12
𝐷1 + 𝐷2 =1
are dependent
Always omit one dummy variable
What happens if we introduce in our model the two DUMMY variables New York and California?
The model can not distinguish the effect of from the effect of
Mathematically, you can not have the quantities at the same time
That is the trap of dummy variables Always omit one dummy variable
Building a Model (Step-By-Step)
𝑿𝟐 𝑿𝟑 𝑿𝟒
𝑿𝟏 Y 𝑿5
𝑿𝟕
Why do we need to get rid of some IVs?
Why we will not use them all in the model?
Garbage in, Garbage out (GIGO): If you throw a lot of variables from
your model in the trash then your model will not be reliable and may
not be a good model because it will not do what it is supposed to do
or at least it is less likely to run correctly.
If you include all the independent variables in your model, you must
explain the impact of each variable on Y which is very difficult and
expensive to do if the number of these variables is important.
Keep only the most important variables that have more impact on the Dependent Variable.
How to build a
model
There is Five methods of building
models:
All-in
Backward Elimination
Bidirectional Elimination
Score Comparison
All-in
Integrate all the independent variables you have in
the multiple regression model
The Methods of verifying a statistical hypothesis are called statistical tests. In fact, a statistical
test is a decision-making: accept or reject the hypothesis. (i.e., the aim of a statistical test is to
verify whether this hypothesis is not false).
The P-value is one of the statistical test developped by the statistician Ronald Fisher (1890-1962)
used to quantify the statistical significance under a null hypothesis testing which is
an argument adapted to statistics. The general idea is to prove that the null hypothesis is not
verified because in the case where it would be the observed result would be highly improbable. It
is therefore an extension of the principle of proof by the absurd. In statistics, a result is said to be
statistically significant when the P-value is lower than the α probability of rejecting the null
hypothesis when it is true. The α probability is usually 0.05, but may vary by study and contex.
The P-value is the likelihood of the data assuming the null hypothesis .
Backward Elimination
STEP 1: Select a significance level to stay in the model (e.g. SL=0.05)
STEP 3: Consider the predictor with highest P-value. If P> SL, go to otherwise FIN
STEP 4,
STEP 4: Remove the
predictor
STEP 5: Fit the model without this variable.
STEP 2: Fit all simple regression model (Y~ ). Select the one with the lowest P-value
STEP 3: Keep this variable and fit all possible models with one extra predictor added to the
one(s) you already have
STEP 4: Consider the predictor with the lowest P-value. If P < SL go to otherwise FIN
STEP 3,
STEP 4: No new variables can enter and no old variables can exit
STEP 2: Construct all possible linear regression models: (-1) total combinations
It is the best method but it is the most resource-intensive method. In fact, if you have a lot of
variables this method is not recommended since it will take a lot of time and resource
Implementation
We will focus on the second method that is the Backward Elimination method because it is the
fastest and the most efficient method
Categorical: Classification
In this section, we deal with a business problem where we have to predict the profit of start-ups
based on several available informations. These informations will be our independent variables.
We have 50 observations.
X 1 : R & D Spend
: Administration Spend
Y: Profit
: Marketing Spend
We import the libraries that we use every time: Numpy, Pandas and matplotlib.pyplot
We change the dataset ‘50_Startups.csv’
Import the dataset and create the matrix of VIs
The DV Y is the last column
There is no missing data Delete the section missing data from the code
New York represents the second column and 1 0 101913.08 110594.11 229160.95
Florida represents the First column
We create the training set, we choose 20% of the observations for the test set and 80% for the
training (i.e., 10 observations for the test set and 40 observations for the training set)
We build the model on the correlation of the 40 observations of the training set and we will
establish new predictions on the 10 observations of the test set
The training set (40 observations) contains X_train and Y_train. The test set (10 observations) is
composed by X_test and Y_test
For the ‘Feature Scaling’ section, as in the Simple Linear Regression model, the coefficients can
adopt their scales so that the products are on the same scale
From the class we create the object We link the regressor to the training
‘LinearRegression’ ‘regressor’ set (X_train and Y_train)
between the Profit and the VIs 1 0 73994.6 122783 303319 110352 113969
1 0 142107 91391.8 366168 166188 167921
The Root Mean Square Error (RMSE) measures the differences between the predicted values by
the model and the observedvalues. It represents the square root of MSE.
MSE=
7) Model Evaluation
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)
We have to enter (in order) the values of the Independent Variables corresponding to the
information of the new start-up
Actually, to enter the state we aren’t going to enter New York as a string of characters because
the predict method expects something like in X_test (i. e., the state variable isn’t in a single
column but in three dummy variables, so you have to enter the combinations of New York in 0
and 1).
New York is the third column in dummy variables then it corresponds to 0 (California) and 0
(Florida).
regressor.predict(0,1,130000,140000,300000)
It doesn’t work
When we have several Independent Variables, we must enter the new information in a tabular
form because this sequence of numbers (0,1,130000,140000,300000) has no sense for Python.
This latter wait the new information in line vector form. We will use the library Numpy (np)
regressor.predict(np.array(0,1,130000,140000,300000))
It doesn’t work
regressor.predict(np.array([[0,1,130000,140000,300000]]))
R² with
Y=+ R² after adding
Y=+ + +
SSE min
Taking the example of the salary of an employee in a company, where is the experience years, is
the number of years of study and is the last number of his phone number
If the variable has nothing to do with the salary, there is even a slight correlation with it, linked
to chance, (i.e., the model will even consider it). So, since the correlation will be zero, the
coefficient associated to the variable will be almost zero. So after the addition of the variable the
model will either minimize SSE (very little) or keep its value
By adding a new variable, the increase of R² will still take place even if it is very small and that
is why we need a new measure that takes into account the number of independent variables and
measures the quality of the model that is the ‘adjusted R-square’ denoted by Adj R²
Penalty Coefficient
R² = = 1-
Regression SSR p
constraint
The Adj R² penalizes you when you add a variable not corrolated to Y.
The number of independent variables appears at the denominator, when you add an independent
variable the penalty coefficient increases. This implies that the quantity (1-R²) increases and then
the Adj R² decreases.
Furthermore, when the number of the independent variable increases the R² increases. It follows
that the quantity (1-R²) decreases and then the Adj R² increases.
The penalty coefficient counterbalance this increase by reducing the Adj R².
if you add a regressor that improves your model (strongly correlated to Y) then R² will increase
considerably and it will considerably increase your Adj R² although the ratio will decrease.
U=np.c_[X,S]
42554.16761772438
Y=+ +
-9.59284160e+02
6.99369053e+02
Constant=regressor.intercept_ 7.73467193e-01
Coefficients=regressor.coef_
3.28845975e-02
3.66100259e-02
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(Y_pred, Y_test)
RMSE=np.sqrt(MSE)
-1 -1 -4 1 -1 -1 4 0 0
-1 1 2 A= 1 -1 1 = 0 4 0
1 -1 0 1 1 -1 0 0 4
1 1 2 1 1 1
0 0
= 0 0
Y= 1
0 0
2
Y= + 2
Template Multiple Linear Regression
1) Importing the Librairies 5) Fitting MLR Model to the Training Set
import numpy as np from sklearn.linear_model import
import pandas as pd LinearRegression
2) Importing the Dataset regressor=LinearRegression()
dataset=pd.read_csv('50_Startups.csv') regressor.fit(X_train,Y_train)
X=dataset.iloc[:,:-1].values 6) Predicting the Test Set Result
Y=dataset.iloc[:,4].values Y_pred=regressor.predict(X_test)
3) Encoding Categorical Data
3.1) Encoding the Independant Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelEncoder_X=LabelEncoder()
X[:,3]=labelEncoder_X.fit_transform(X[:,3])
onehotencoder=OneHotEncoder(categorical_features=[3])
X=onehotencoder.fit_transform(X).toarray()
3.2) Avoiding the Dummy Variable Trap
X=X[:,1:]
4) Splitting the dataset into the Training Set and Test Set
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,
The Backward Elimination method
Coun R& D Spend Admin Marketing
. . . . .
. . . . .
np.set_printoptions(threshold=np.nan)
1 0 1315.46 115816 297114
0 0 0 135427 0
0 1 542.05 51743.2 0
0 0 0 116984 45173.1
We have implemented the Multiple Linear Regresion model that we have fitted to the training
set.
Do we have an optimal
model?
No, when we have built the model, we have used all the Independent Variables.
What is happening if among these varaibles there are some highly statisticaly significant
varaibles (that is, variable that have a great impact or effect on the Dependent Variable profit)
and some other without influence on the Dependent Variable?
If we remove these later variables (without influence), we will certainly have a much more
significant model.
Find an optimal team of IVs such that each one has a great impact on the
DV. This effect can be positif (i.e., when this variable increases the profit Y
increases) or negatif (i.e., when it increases, the DV Y also decreases).
For the backward Elimination, we need to import the Library stats.models.formula.api
import statsmodels.api as sm
Multiple Linear
Regression Y=+ +
The constant is not associated to any independent Variables, we need to add a vector column
(with 50 lines) to the matrix of features X, the coefficients of this column are equal to 1. Hence,
the equation becomes
Y= + +
X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)
This, add a column in the end of the initial Axis = 1, because we need to
matrix X, but in our case, we need to add add a column, if we need to
the column in the begining of the matrix X add a line we put axis=0
X=np.append(arr=np.ones((50,1)).astype(int), values=X,axis=1)
The Backward Elimination consists of incuding all the Independent Variables at first and then we
will remove one by one the Independent Variables that are not statistical significant
We start by taking all the lines and all the columns
X_opt=X[:,[0,1,2,3,4,5]]
In Step 2 of the Backward Elimination, we need to fit X but we have introduced a new Library,
then we will fit X to a new regressor model related to the new Library. So we create the
‘regressor’ object of the OLS class
regressor_OLS=sm.OLS()
There are two parameters for the OLS class: endog fro the Dependent Variable and exog for the
Independent Variables X_opt
regressor_OLS=sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
Step 2 of the Backward Elimination is made. Now, we look the p-values of the IVs
Regressor_OLS.summary()
This command gives the table of the statistical informations of the model ( VIs, R-squared,
Adjusted r-squared, AIC BIC, P-values,…)
We look for the highest P-value, which is above the Significant Level 0.05, then we go to Step 4,
and we remove the associated Independent Variable and then we fit the model without this
variable
We repeat this action until, we find that the highest P-value is not higher than Significant Level
0.05, we deduce that while the model is now ready
6) Building the Optimal Model Using Backward Elimination
import statsmodels.formula.api as sm
X=np.append(arr=X,values=np.ones((50,1)).astype(int), axis=1)
(arr=np.ones((50,1)).astype(int),values=X,axis=1)
. . . . . .
1 0 1 542.05 51743.2 0
1 0 0 0 116984 45173.1
X_opt=X[:,[0,1,2,3,4,5]]
regressor_OLS=sm.OLS(endog = Y,exog = X_opt).fit()
regressor_OLS=sm.OLS(Y,X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
R-squared: 0.951
Adj. R-squared: 0.945
F-statistic: 169.9
Prob (F-statistic): 1.34e-27
Log-Likelihood: -525.38 AIC: 1063. BIC: 1074 .