13th LESSON (ANKUR - PROSCHOOL) - LINEAR REGRESSION CASE STU

LINEAR REGRESSION
Linear regression is a predictive modeling technique for predicting a numeric response variable
based on one or more explanatory variables. The term "regression" in predictive modeling
generally refers to any modeling task that involves predicting a real number (as opposed
classification, which involves predicting a category or class.). The term "linear" in the name linear
regression refers to the fact that the method models data with linear combination of the
explanatory variables. A linear combination is an expression where one or more variables are
scaled by a constant factor and added together. In the case of linear regression with a single
explanatory variable, the linear combination used in linear regression can be expressed as:
response = intercept + constant ∗ e

xplanatory
The right side if the equation defines a line with a certain y-intercept and slope times the
explanatory variable. In other words, linear regression in its most basic form fits a straight line to
the response variable. The model is designed to fit a line that minimizes the squared differences
(also called errors or residuals.). We won't go into all the math behind how the model actually
minimizes the squared errors, but the end result is a line intended to give the "best fit" to the
data. Since linear regression fits data with a line, it is most effective in cases where the response
and explanatory variable have a linear relationship.
First, let's load some libraries and look at a scatterplot of weight and mpg to get a sense of the
shape of the data:
Load the following packages,
• Numpy – For numeric functions in python
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
• Pandas – Package to work with data preparations
• Matplotlib.Pyplot – For plots
• Sklearn – Scikit learn has several functions for machine learning algorithms
• Seaborn – For plots
• Statsmodels.formula.api – Linear regression function
Use Linear Regression to predict Vehicle Gas

Mileage based on Vehicle Weight
In [1]: #importing packages
import numpy as np #For numeric functions in pytho
n
import pandas as pd #Package to work with data prep
arations
import matplotlib.pyplot as plt #For plots
%matplotlib inline
import sklearn as sk #Scikit learn has several funct
ions for machine learning algorithms
import seaborn as sns #For plots
import statsmodels.formula.api as sm #Linear regression function
In [2]: #Load data

data = pd.read_csv("LR_11.csv", header=0)
OVERVIEW OF THE DATA

In [3]: data
Out[3]:
House Price Square Feet City
0 245 1400 N
1 312 1600 Y
2 279 1700 N
3 308 1875 Y
4 199 1100 N
5 219 1550 N
6 405 2350 Y
7 324 2450 Y
8 319 1425 Y
9 255 1700 N
In [4]: print("Data shape : ",data.shape)

print(data.head())
print(data.dtypes)
Data shape : (10, 3)

House Price Square Feet City
0 245 1400 N
1 312 1600 Y
2 279 1700 N
3 308 1875 Y
4 199 1100 N
House Price int64
Square Feet int64
City object
dtype: object
STEP 1: Get all categorical variables and create

dummies
In [5]: obj = data.dtypes == np.object
print(obj)
House Price False

Square Feet False
City True
dtype: bool
In [6]: data.dtypes #The “dtypes” gets the data type, using which we pass th
is column to the “get_dummies”
#function to convert the categories to dummy columns.
Out[6]: House Price int64

Square Feet int64
City object
dtype: object
DUMMY VARIABLE: In case we need to use categorical variables in our modelling then we need
to convert those into DUMMY Variables.
We look into unique values in a categorial variable. Use CountUnique to find that data in case the
data is huge
In case we have n unique variables then we will introduce (n-1) dummy variable
Note CITY is a Categorical Variable with 2 unique values and hence one DUMMY VARIABLE will
be created
In [7]: data.columns[obj]
Out[7]: Index(['City'], dtype='object')
In [8]: dummydf = pd.DataFrame()
for i in data.columns[obj]:
dummy=pd.get_dummies(data[i], prefix='City', drop_first=True)
#"drop_first" drops the first category in order to avoid multicolli
nearity problem
#prefix is used to add a certain prefix to all the dummy variables
created for any particular categorial variable
dummydf=pd.concat([dummydf, dummy], axis=1) # Concatenating Columns

#"pd.concat" combines all the dummy columns for all the categorical
variables
print(dummydf)
City_Y
0 0
1 1
2 0
3 1
4 0
5 0
6 1
7 1
8 1
9 0
STEP 2: Merge the dummy data with the original

data
In [10]: data1=data
data1=pd.concat([data1,dummydf], axis=1)
print("Head:\n", data1.head())
obj1=data1.dtypes==np.object
data1=data1.drop(data1.columns[obj1], axis=1) #drop” function is used t
o drop the original column of the categorical variable
print("Head After Removal:\n", data1.head())
Head:
House Price Square Feet City City_Y
0 245 1400 N 0
1 312 1600 Y 1
2 279 1700 N 0
3 308 1875 Y 1
4 199 1100 N 0
Head After Removal:
House Price Square Feet City_Y
0 245 1400 0
1 312 1600 1
2 279 1700 0
3 308 1875 1
4 199 1100 0
STEP 3: SETTING VARIABLES
Separate the dependent & independent variables into two objects.
Scikit-Learn expects X to be a feature matrix (Pandas Dataframe) and y to be a response vector

(Pandas Series). Let’s begin by separating our variables as below.
In [11]: #Declare the dependent variable and create your independent and depende
nt datasets
#Separate the dependent & independent variables into two objects.
dep='House Price'
X = data1.drop(dep, axis=1)
Y = data1[dep]
In [12]: X
Out[12]:
Square Feet City_Y
0 1400 0
Square Feet City_Y
1 1600 1
2 1700 0
3 1875 1
4 1100 0
5 1550 0
6 2350 1
7 2450 1
8 1425 1
9 1700 0
In [13]: Y
Out[13]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
Name: House Price, dtype: int64
Scatter plots can be used to check the relation of the

variables
Using visualisation, you should be able to judge which variables have a linear relationship with y.
Start by using Seaborn’s pairplot.
Additional parameters to use:
size= : Allows you to manipulate the size of the rendered pairplot
kind= ‘reg’ : Will attempt to add line of best fit and a 95% confidence band. Will aim to minimize
sum of squared error.
In [15]: #Scatter plots

#Scatter plots can be used to check the relation of the variables
sns.pairplot(data1, kind='reg')
Out[15]: <seaborn.axisgrid.PairGrid at 0x24ea275cf98>
STEP 4: SPLITTING OUR DATA INTO TRAIN &
TEST DATA
Split the data into train & test for model building and validating the model
• We us the model_selection in sklearn package to split the data into train & test
Splitting X & y into training and testing sets:
By passing our X and y variables into the train_test_split method, we are able to capture the
splits in data by assigning 4 variables to the result.
In [17]: from sklearn.model_selection import train_test_split #Split into tra

in and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2
0, random_state=5)
In [18]: print('XTrain:\n', X_train)

print('XTest:\n', X_test)
print('YTrain:\n', Y_train)
print('YTest:\n', Y_test)
XTrain:
Square Feet City_Y
2 1700 0
4 1100 0
7 2450 1
1 1600 1
0 1400 0
8 1425 1
6 2350 1
3 1875 1
XTest:
Square Feet City_Y
9 1700 0
5 1550 0
YTrain:
2 279
4 199
7 324
1 312
0 245
8 319
6 405
3 308
YTest:
9 255
5 219
The “test_size” is used to specify size of the test data, in our case 20%
“random_state” is used to generate the same set of sample which helps in replicating the results
(It acts like seed to generate random number)
STEP 5: MODEL BUILDING

Use the OLS function in the statsmodels. formual.api to build a linear regression model
In [19]: #Run model

#Ordinary Least-Squares (OLS) Regression Method
lm = sm.OLS(Y_train, X_train ).fit()
lm.summary() #To view the OLS regression r
esults, we can call the .summary() method
C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))
Out[19]:
OLS Regression Results
Dep. Variable: House Price R-squared: 0.979
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 143.1
Date: Thu, 06 Jun 2019 Prob (F-statistic): 8.65e-06
Time: 15:41:42 Log-Likelihood: -41.550
No. Observations: 8 AIC: 87.10
Df Residuals: 6 BIC: 87.26
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Square Feet 0.1567 0.019 8.163 0.000 0.110 0.204
City_Y 29.5832 43.518 0.680 0.522 -76.900 136.067
Omnibus: 6.866 Durbin-Watson: 2.239
Prob(Omnibus): 0.032 Jarque-Bera (JB): 1.994
Skew: -1.151 Prob(JB): 0.369
Kurtosis: 3.826 Cond. No. 4.38e+03
From our results, we see that:
The slope of SquareFeet = 0.1567 The slope of City_Y = 29.5832
The positive slopes imply that they have a positive effect on House Prices.
The p-value of 0.000 for SquareFeet implies that the effect of Square Feet on House Prices is
statistically significant (using p < 0.05 as a rejection rule)
The p-value of 0.522 for City_Y implies that the effect of City on House Prices is statistically
insignificant for the current data set. and Hence should be removed from the data. (Covering
below)
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.
R-Squared
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its
mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Problems with R-Squared

Problem 1: Every time you add a predictor to a model, the R-squared increases, even if due to
chance alone. It never decreases. Consequently, a model with more terms may appear to have a
better fit simply because it has more terms.
Problem 2: If a model has too many predictors and higher order polynomials, it begins to model
the random noise in the data. This condition is known as overfitting the model and it produces
misleadingly high R-squared values and a lessened ability to make predictions.
Adjusted R-square
Both adjusted R-squared and Adjusted R-square provide information that helps you assess the
number of predictors in your model:
Use the adjusted R-square to compare models with different numbers of predictors Use the
predicted R-square to determine how well the model predicts new observations and whether the
model is too complicated
REMOVING VARIABLE WITH P >0.05

In [20]: data2=data1
data2=data2.drop(['City_Y'], axis=1)
data2
Out[20]:
House Price Square Feet
0 245 1400
1 312 1600
2 279 1700
3 308 1875
4 199 1100
5 219 1550
6 405 2350
7 324 2450
8 319 1425
9 255 1700
In [22]: dep2='House Price'

X2 = data2.drop(dep2, axis=1)
Y2 = data2[dep2]
In [23]: X2
Out[23]:
Square Feet
Square Feet
0 1400
1 1600
2 1700
3 1875
4 1100
5 1550
6 2350
7 2450
8 1425
9 1700
In [24]: Y2
Out[24]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
In [25]: from sklearn.model_selection import train_test_split #Split into tra

in and test
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y2, test_si
ze=0.20, random_state=5)
In [29]: #Run model
#Ordinary Least-Squares (OLS) Regression Method
lm2=sm.OLS(Y2_train, X2_train ).fit()
lm2.summary() #To view the OLS regression
results, we can call the .summary() method
C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Dep. Variable: House Price R-squared: 0.978
Model: OLS Adj. R-squared: 0.975
Method: Least Squares F-statistic: 309.6
Date: Thu, 06 Jun 2019 Prob (F-statistic): 4.72e-07
Time: 15:44:41 Log-Likelihood: -41.847
No. Observations: 8 AIC: 85.69
Df Residuals: 7 BIC: 85.77
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Square Feet 0.1679 0.010 17.596 0.000 0.145 0.190
Omnibus: 3.100 Durbin-Watson: 2.378
Prob(Omnibus): 0.212 Jarque-Bera (JB): 0.522
Skew: -0.587 Prob(JB): 0.770
Kurtosis: 3.433 Cond. No. 1.00
In [30]: lm2.predict(X2_test)
Out[30]: 9 285.394541
5 260.212670
dtype: float64
In [31]: #to check residual behaviour

pred_train = lm2.predict(X2_train)
err_train = pred_train - Y2_train
print(pred_train)
print(err_train)
2 285.394541
4 184.667056
7 411.303897
1 268.606627
0 235.030798
8 239.227777
6 394.515983
3 314.773391
dtype: float64
2 6.394541
4 -14.332944
7 87.303897
1 -43.393373
0 -9.969202
8 -79.772223
6 -10.484017
3 6.773391
dtype: float64
In [32]: #Predict
pred_test = lm2.predict(X2_test)
err_test = pred_test - Y2_test
print(pred_test)
print(err_test)
9 285.394541
5 260.212670
dtype: float64
9 30.394541
5 41.212670
dtype: float64
In [33]: #Actual vs predicted plot

plt.scatter(Y2_train, pred_train)
plt.xlabel('Y')
plt.ylabel('Pred')
plt.title('Main')
Out[33]: Text(0.5,1,'Main')
In [34]: #Root Mean sq error

rmse = np.sqrt(np.mean((err_test))**2)
rmse
Out[34]: 35.8036053130929
In [36]: #MAPE - Mean absolute percentage error

#this is in %age terms and hence a better metric for model performance
mape=np.mean(np.abs((Y2_train - pred_train) / Y2_train)) * 100
mape
Out[36]: 10.526505212851921
Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-
squared, the overall F-test, and the Root Mean Square Error (RMSE). All three are based on two
sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures
how far the data are from the mean, and SSE measures how far the data are from the model’s
predicted values. Different combinations of these two values provide different information about
how the regression model compares to the mean model.
R-squared and Adjusted R-squared
The difference between SST and SSE is the improvement in prediction from the regression
model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the
proportional improvement in prediction from the regression model, compared to the mean model.
It indicates the goodness of fit of the model.
R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero
indicating that the proposed model does not improve prediction over the mean model, and one
indicating perfect prediction. Improvement in the regression model results in proportional
increases in R-squared.
RMSE
The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the
model to the data–how close the observed data points are to the model’s predicted values.
Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the
square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained
variance, and has the useful property of being in the same units as the response variable. Lower
values of RMSE indicate better fit. RMSE is a good measure of how accurately the model
predicts the response, and it is the most important criterion for fit if the main purpose of the
model is prediction.
The best measure of model fit depends on the researcher’s objectives, and more than one are
often useful. The statistics discussed above are applicable to regression models that use OLS
estimation.
In [37]: #Residual plots

plt.scatter(pred_train, err_train, c="b", s=40, alpha=0.5)
plt.scatter(pred_test,err_test, c="g", s=40)
plt.hlines(y=0, xmin=0, xmax=500)
plt.title('Residual plot - Train(blue), Test(green)')
plt.ylabel('Residuals')
Out[37]: Text(0,0.5,'Residuals')
In [38]: print(lm2.predict())
[285.39454094 184.6670559 411.30389724 268.60662677 235.03079842

239.22777697 394.51598307 314.77339075]
In [40]: #Homoscedasticity
resid = lm2.resid
plt.scatter(lm2.predict(), resid)
Out[40]: <matplotlib.collections.PathCollection at 0x24ea391b390>

13th LESSON (ANKUR - PROSCHOOL) - LINEAR REGRESSION CASE STU

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13th LESSON (ANKUR - PROSCHOOL) - LINEAR REGRESSION CASE STU

Uploaded by

Copyright:

Available Formats

LINEAR REGRESSION

response = intercept + constant ∗ e

Load the following packages,

• Numpy – For numeric functions in python

• Matplotlib.Pyplot – For plots

• Seaborn – For plots

• Statsmodels.formula.api – Linear regression function

Use Linear Regression to predict Vehicle Gas

In [2]: #Load data

OVERVIEW OF THE DATA

In [4]: print("Data shape : ",data.shape)

Data shape : (10, 3)

STEP 1: Get all categorical variables and create

House Price False

Out[6]: House Price int64

Out[7]: Index(['City'], dtype='object')

In [8]: dummydf = pd.DataFrame()

dummydf=pd.concat([dummydf, dummy], axis=1) # Concatenating Columns

STEP 2: Merge the dummy data with the original

STEP 3: SETTING VARIABLES

Separate the dependent & independent variables into two objects.

Scikit-Learn expects X to be a feature matrix (Pandas Dataframe) and y to be a response vector

Scatter plots can be used to check the relation of the

size= : Allows you to manipulate the size of the rendered pairplot

In [15]: #Scatter plots

Out[15]: <seaborn.axisgrid.PairGrid at 0x24ea275cf98>

Splitting X & y into training and testing sets:

In [17]: from sklearn.model_selection import train_test_split #Split into tra

In [18]: print('XTrain:\n', X_train)

STEP 5: MODEL BUILDING

In [19]: #Run model

Dep. Variable: House Price R-squared: 0.979

Model: OLS Adj. R-squared: 0.973

Method: Least Squares F-statistic: 143.1

Date: Thu, 06 Jun 2019 Prob (F-statistic): 8.65e-06

No. Observations: 8 AIC: 87.10

Df Residuals: 6 BIC: 87.26

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Square Feet 0.1567 0.019 8.163 0.000 0.110 0.204

City_Y 29.5832 43.518 0.680 0.522 -76.900 136.067

Omnibus: 6.866 Durbin-Watson: 2.239

Prob(Omnibus): 0.032 Jarque-Bera (JB): 1.994

Skew: -1.151 Prob(JB): 0.369

Kurtosis: 3.826 Cond. No. 4.38e+03

From our results, we see that:

The slope of SquareFeet = 0.1567 The slope of City_Y = 29.5832

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

Problems with R-Squared

REMOVING VARIABLE WITH P >0.05

In [22]: dep2='House Price'

In [25]: from sklearn.model_selection import train_test_split #Split into tra

Dep. Variable: House Price R-squared: 0.978

Model: OLS Adj. R-squared: 0.975

Method: Least Squares F-statistic: 309.6

Date: Thu, 06 Jun 2019 Prob (F-statistic): 4.72e-07

Time: 15:44:41 Log-Likelihood: -41.847

No. Observations: 8 AIC: 85.69

Df Residuals: 7 BIC: 85.77

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Square Feet 0.1679 0.010 17.596 0.000 0.145 0.190

Omnibus: 3.100 Durbin-Watson: 2.378