You are on page 1of 21

LINEAR REGRESSION

Linear regression is a predictive modeling technique for predicting a numeric response variable
based on one or more explanatory variables. The term "regression" in predictive modeling
generally refers to any modeling task that involves predicting a real number (as opposed
classification, which involves predicting a category or class.). The term "linear" in the name linear
regression refers to the fact that the method models data with linear combination of the
explanatory variables. A linear combination is an expression where one or more variables are
scaled by a constant factor and added together. In the case of linear regression with a single
explanatory variable, the linear combination used in linear regression can be expressed as:

response = intercept + constant ∗ e


xplanatory

The right side if the equation defines a line with a certain y-intercept and slope times the
explanatory variable. In other words, linear regression in its most basic form fits a straight line to
the response variable. The model is designed to fit a line that minimizes the squared differences
(also called errors or residuals.). We won't go into all the math behind how the model actually
minimizes the squared errors, but the end result is a line intended to give the "best fit" to the
data. Since linear regression fits data with a line, it is most effective in cases where the response
and explanatory variable have a linear relationship.

First, let's load some libraries and look at a scatterplot of weight and mpg to get a sense of the
shape of the data:

Load the following packages,

• Numpy – For numeric functions in python

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
• Pandas – Package to work with data preparations

• Matplotlib.Pyplot – For plots

• Sklearn – Scikit learn has several functions for machine learning algorithms

• Seaborn – For plots

• Statsmodels.formula.api – Linear regression function

Use Linear Regression to predict Vehicle Gas


Mileage based on Vehicle Weight
In [1]: #importing packages
import numpy as np #For numeric functions in pytho
n
import pandas as pd #Package to work with data prep
arations
import matplotlib.pyplot as plt #For plots
%matplotlib inline
import sklearn as sk #Scikit learn has several funct
ions for machine learning algorithms
import seaborn as sns #For plots
import statsmodels.formula.api as sm #Linear regression function

In [2]: #Load data


data = pd.read_csv("LR_11.csv", header=0)

OVERVIEW OF THE DATA


In [3]: data

Out[3]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
House Price Square Feet City

0 245 1400 N

1 312 1600 Y

2 279 1700 N

3 308 1875 Y

4 199 1100 N

5 219 1550 N

6 405 2350 Y

7 324 2450 Y

8 319 1425 Y

9 255 1700 N

In [4]: print("Data shape : ",data.shape)


print(data.head())
print(data.dtypes)

Data shape : (10, 3)


House Price Square Feet City
0 245 1400 N
1 312 1600 Y
2 279 1700 N
3 308 1875 Y
4 199 1100 N
House Price int64
Square Feet int64
City object
dtype: object

STEP 1: Get all categorical variables and create


Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
dummies
In [5]: obj = data.dtypes == np.object
print(obj)

House Price False


Square Feet False
City True
dtype: bool

In [6]: data.dtypes #The “dtypes” gets the data type, using which we pass th
is column to the “get_dummies”
#function to convert the categories to dummy columns.

Out[6]: House Price int64


Square Feet int64
City object
dtype: object

DUMMY VARIABLE: In case we need to use categorical variables in our modelling then we need
to convert those into DUMMY Variables.

We look into unique values in a categorial variable. Use CountUnique to find that data in case the
data is huge

In case we have n unique variables then we will introduce (n-1) dummy variable

Note CITY is a Categorical Variable with 2 unique values and hence one DUMMY VARIABLE will
be created

In [7]: data.columns[obj]

Out[7]: Index(['City'], dtype='object')

In [8]: dummydf = pd.DataFrame()

for i in data.columns[obj]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
dummy=pd.get_dummies(data[i], prefix='City', drop_first=True)
#"drop_first" drops the first category in order to avoid multicolli
nearity problem
#prefix is used to add a certain prefix to all the dummy variables
created for any particular categorial variable

dummydf=pd.concat([dummydf, dummy], axis=1) # Concatenating Columns


#"pd.concat" combines all the dummy columns for all the categorical
variables

print(dummydf)

City_Y
0 0
1 1
2 0
3 1
4 0
5 0
6 1
7 1
8 1
9 0

STEP 2: Merge the dummy data with the original


data
In [10]: data1=data
data1=pd.concat([data1,dummydf], axis=1)
print("Head:\n", data1.head())

obj1=data1.dtypes==np.object
data1=data1.drop(data1.columns[obj1], axis=1) #drop” function is used t
o drop the original column of the categorical variable
print("Head After Removal:\n", data1.head())

Head:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
House Price Square Feet City City_Y
0 245 1400 N 0
1 312 1600 Y 1
2 279 1700 N 0
3 308 1875 Y 1
4 199 1100 N 0
Head After Removal:
House Price Square Feet City_Y
0 245 1400 0
1 312 1600 1
2 279 1700 0
3 308 1875 1
4 199 1100 0

STEP 3: SETTING VARIABLES

Separate the dependent & independent variables into two objects.

Scikit-Learn expects X to be a feature matrix (Pandas Dataframe) and y to be a response vector


(Pandas Series). Let’s begin by separating our variables as below.

In [11]: #Declare the dependent variable and create your independent and depende
nt datasets
#Separate the dependent & independent variables into two objects.
dep='House Price'
X = data1.drop(dep, axis=1)
Y = data1[dep]

In [12]: X

Out[12]:
Square Feet City_Y

0 1400 0

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Square Feet City_Y

1 1600 1

2 1700 0

3 1875 1

4 1100 0

5 1550 0

6 2350 1

7 2450 1

8 1425 1

9 1700 0

In [13]: Y

Out[13]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
Name: House Price, dtype: int64

Scatter plots can be used to check the relation of the


variables

Using visualisation, you should be able to judge which variables have a linear relationship with y.
Start by using Seaborn’s pairplot.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Additional parameters to use:

size= : Allows you to manipulate the size of the rendered pairplot

kind= ‘reg’ : Will attempt to add line of best fit and a 95% confidence band. Will aim to minimize
sum of squared error.

In [15]: #Scatter plots


#Scatter plots can be used to check the relation of the variables
sns.pairplot(data1, kind='reg')

Out[15]: <seaborn.axisgrid.PairGrid at 0x24ea275cf98>

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 4: SPLITTING OUR DATA INTO TRAIN &
TEST DATA
Split the data into train & test for model building and validating the model

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
• We us the model_selection in sklearn package to split the data into train & test

Splitting X & y into training and testing sets:

By passing our X and y variables into the train_test_split method, we are able to capture the
splits in data by assigning 4 variables to the result.

In [17]: from sklearn.model_selection import train_test_split #Split into tra


in and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2
0, random_state=5)

In [18]: print('XTrain:\n', X_train)


print('XTest:\n', X_test)
print('YTrain:\n', Y_train)
print('YTest:\n', Y_test)

XTrain:
Square Feet City_Y
2 1700 0
4 1100 0
7 2450 1
1 1600 1
0 1400 0
8 1425 1
6 2350 1
3 1875 1
XTest:
Square Feet City_Y
9 1700 0
5 1550 0
YTrain:
2 279
4 199
7 324
1 312
0 245
8 319

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
6 405
3 308
Name: House Price, dtype: int64
YTest:
9 255
5 219
Name: House Price, dtype: int64

The “test_size” is used to specify size of the test data, in our case 20%

“random_state” is used to generate the same set of sample which helps in replicating the results
(It acts like seed to generate random number)

STEP 5: MODEL BUILDING


Use the OLS function in the statsmodels. formual.api to build a linear regression model

In [19]: #Run model


#Ordinary Least-Squares (OLS) Regression Method
lm = sm.OLS(Y_train, X_train ).fit()
lm.summary() #To view the OLS regression r
esults, we can call the .summary() method

C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))

Out[19]:
OLS Regression Results

Dep. Variable: House Price R-squared: 0.979

Model: OLS Adj. R-squared: 0.973

Method: Least Squares F-statistic: 143.1

Date: Thu, 06 Jun 2019 Prob (F-statistic): 8.65e-06

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Time: 15:41:42 Log-Likelihood: -41.550

No. Observations: 8 AIC: 87.10

Df Residuals: 6 BIC: 87.26

Df Model: 2

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Square Feet 0.1567 0.019 8.163 0.000 0.110 0.204

City_Y 29.5832 43.518 0.680 0.522 -76.900 136.067

Omnibus: 6.866 Durbin-Watson: 2.239

Prob(Omnibus): 0.032 Jarque-Bera (JB): 1.994

Skew: -1.151 Prob(JB): 0.369

Kurtosis: 3.826 Cond. No. 4.38e+03

From our results, we see that:

The slope of SquareFeet = 0.1567 The slope of City_Y = 29.5832

The positive slopes imply that they have a positive effect on House Prices.

The p-value of 0.000 for SquareFeet implies that the effect of Square Feet on House Prices is
statistically significant (using p < 0.05 as a rejection rule)

The p-value of 0.522 for City_Y implies that the effect of City on House Prices is statistically
insignificant for the current data set. and Hence should be removed from the data. (Covering
below)

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
regression.

R-Squared
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its
mean.

100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

Problems with R-Squared


Problem 1: Every time you add a predictor to a model, the R-squared increases, even if due to
chance alone. It never decreases. Consequently, a model with more terms may appear to have a
better fit simply because it has more terms.

Problem 2: If a model has too many predictors and higher order polynomials, it begins to model
the random noise in the data. This condition is known as overfitting the model and it produces
misleadingly high R-squared values and a lessened ability to make predictions.

Adjusted R-square
Both adjusted R-squared and Adjusted R-square provide information that helps you assess the
number of predictors in your model:

Use the adjusted R-square to compare models with different numbers of predictors Use the
predicted R-square to determine how well the model predicts new observations and whether the

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
model is too complicated

REMOVING VARIABLE WITH P >0.05


In [20]: data2=data1
data2=data2.drop(['City_Y'], axis=1)
data2

Out[20]:
House Price Square Feet

0 245 1400

1 312 1600

2 279 1700

3 308 1875

4 199 1100

5 219 1550

6 405 2350

7 324 2450

8 319 1425

9 255 1700

In [22]: dep2='House Price'


X2 = data2.drop(dep2, axis=1)
Y2 = data2[dep2]

In [23]: X2

Out[23]:
Square Feet

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Square Feet

0 1400

1 1600

2 1700

3 1875

4 1100

5 1550

6 2350

7 2450

8 1425

9 1700

In [24]: Y2

Out[24]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
Name: House Price, dtype: int64

In [25]: from sklearn.model_selection import train_test_split #Split into tra


in and test
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y2, test_si
ze=0.20, random_state=5)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [29]: #Run model
#Ordinary Least-Squares (OLS) Regression Method
lm2=sm.OLS(Y2_train, X2_train ).fit()
lm2.summary() #To view the OLS regression
results, we can call the .summary() method

C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results

Dep. Variable: House Price R-squared: 0.978

Model: OLS Adj. R-squared: 0.975

Method: Least Squares F-statistic: 309.6

Date: Thu, 06 Jun 2019 Prob (F-statistic): 4.72e-07

Time: 15:44:41 Log-Likelihood: -41.847

No. Observations: 8 AIC: 85.69

Df Residuals: 7 BIC: 85.77

Df Model: 1

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Square Feet 0.1679 0.010 17.596 0.000 0.145 0.190

Omnibus: 3.100 Durbin-Watson: 2.378

Prob(Omnibus): 0.212 Jarque-Bera (JB): 0.522

Skew: -0.587 Prob(JB): 0.770

Kurtosis: 3.433 Cond. No. 1.00

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [30]: lm2.predict(X2_test)

Out[30]: 9 285.394541
5 260.212670
dtype: float64

In [31]: #to check residual behaviour


pred_train = lm2.predict(X2_train)
err_train = pred_train - Y2_train
print(pred_train)
print(err_train)

2 285.394541
4 184.667056
7 411.303897
1 268.606627
0 235.030798
8 239.227777
6 394.515983
3 314.773391
dtype: float64
2 6.394541
4 -14.332944
7 87.303897
1 -43.393373
0 -9.969202
8 -79.772223
6 -10.484017
3 6.773391
dtype: float64

In [32]: #Predict
pred_test = lm2.predict(X2_test)
err_test = pred_test - Y2_test
print(pred_test)
print(err_test)

9 285.394541

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
5 260.212670
dtype: float64
9 30.394541
5 41.212670
dtype: float64

In [33]: #Actual vs predicted plot


plt.scatter(Y2_train, pred_train)
plt.xlabel('Y')
plt.ylabel('Pred')
plt.title('Main')

Out[33]: Text(0.5,1,'Main')

In [34]: #Root Mean sq error


rmse = np.sqrt(np.mean((err_test))**2)
rmse

Out[34]: 35.8036053130929

In [36]: #MAPE - Mean absolute percentage error


#this is in %age terms and hence a better metric for model performance

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mape=np.mean(np.abs((Y2_train - pred_train) / Y2_train)) * 100
mape

Out[36]: 10.526505212851921

Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-
squared, the overall F-test, and the Root Mean Square Error (RMSE). All three are based on two
sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures
how far the data are from the mean, and SSE measures how far the data are from the model’s
predicted values. Different combinations of these two values provide different information about
how the regression model compares to the mean model.

R-squared and Adjusted R-squared

The difference between SST and SSE is the improvement in prediction from the regression
model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the
proportional improvement in prediction from the regression model, compared to the mean model.
It indicates the goodness of fit of the model.

R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero
indicating that the proposed model does not improve prediction over the mean model, and one
indicating perfect prediction. Improvement in the regression model results in proportional
increases in R-squared.

RMSE

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the
model to the data–how close the observed data points are to the model’s predicted values.
Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the
square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained
variance, and has the useful property of being in the same units as the response variable. Lower
values of RMSE indicate better fit. RMSE is a good measure of how accurately the model
predicts the response, and it is the most important criterion for fit if the main purpose of the
model is prediction.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
The best measure of model fit depends on the researcher’s objectives, and more than one are
often useful. The statistics discussed above are applicable to regression models that use OLS
estimation.

In [37]: #Residual plots


plt.scatter(pred_train, err_train, c="b", s=40, alpha=0.5)
plt.scatter(pred_test,err_test, c="g", s=40)
plt.hlines(y=0, xmin=0, xmax=500)
plt.title('Residual plot - Train(blue), Test(green)')
plt.ylabel('Residuals')

Out[37]: Text(0,0.5,'Residuals')

In [38]: print(lm2.predict())

[285.39454094 184.6670559 411.30389724 268.60662677 235.03079842


239.22777697 394.51598307 314.77339075]

In [40]: #Homoscedasticity
resid = lm2.resid
plt.scatter(lm2.predict(), resid)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[40]: <matplotlib.collections.PathCollection at 0x24ea391b390>

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

You might also like