You are on page 1of 15

linear-regression

March 24, 2024

0.1 Importing needful libraries and modules


[1]: import pandas as pd
from sklearn.model_selection import train_test_split #For splitting the data␣
↪into train & test

from sklearn.linear_model import LinearRegression #Linear regression model


from sklearn.metrics import mean_squared_error #Metric for regression MSE
import statsmodels.api as sm #Regression Model Summary
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
import scipy as sp
import statsmodels.tsa.api as smt

[22]: #loading the dataset


data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/Boston.CSV")

#get the first 5 rows for last 5 rows use data.tail()


data.head()

[22]: crim zn indus nox rm age dis rad tax ptratio \


0 0.00632 18.0 2.31 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7
4 0.06905 0.0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7

black lstat medv


0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 5.33 36.2

1
Variable Description
• CRIM: Per capita crime rate by town
• ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS: Proportion of non-retail business acres per town.
• NOX: Nitric oxides concentration (parts per 10 million)
• RM: Average number of rooms per dwelling
• AGE: Proportion of owner-occupied units built prior to 1940
• DIS: Weighted distances to five Boston employment centers
• RAD: Index of accessibility to radial highways
• TAX: Full-value property-tax rate per 10,000 dollars
• PTRATIO: Pupil-teacher ratio by town
• BLACK: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT: percentage of lower status of the population
• MEDV: Median value of owner-occupied homes in 1,000’s dollars
Target Variable is Medv which is the House Price
[3]: data.shape

[3]: (506, 13)

[4]: #get the column and row count

print("Columns:",data.shape[1])
print("Rows:",data.shape[0])

Columns: 13
Rows: 506

1 Seperating independent data matrix & response vector


[35]: #define x, y
x = data.drop(columns = ['medv'], axis=1) #independent variables
y = data.medv #target vaiable

[ ]:

2 Splitting data into training & testing sets (Validation set ap-
proach)
[6]: # where training set is 20% of actual data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

2
3 Creating a linear regression model object
[7]: model=LinearRegression()

4 Training the model with training data


[8]: model.fit(x_train,y_train)

[8]: LinearRegression()

5 Estimated model parameters for independent variables


[9]: #coefficients of the independent variables
model.coef_

[9]: array([-1.25156811e-01, 4.48575121e-02, 2.01265385e-02, -1.62895522e+01,


3.77366409e+00, -2.13185120e-03, -1.40611846e+00, 2.66835482e-01,
-1.21692284e-02, -1.07660161e+00, 8.72533635e-03, -4.94425348e-01])

6 Intercept of the model


[10]: # beta-0 value
model.intercept_

[10]: 38.42706211257645

7 R-Squared value for the trained model


[11]: model.score(x_train,y_train)*100

[11]: 76.90953567794605

8 Predicting the response for the unseen testing independent data


[12]: y_pred=model.predict(x_test)

3
9 Mean Squared Error for the testing data
[13]: MSE=mean_squared_error(y_pred,y_test)
MSE

[13]: 34.21225254753325

10 Root Mean Squared Error


[14]: np.sqrt(MSE)

[14]: 5.84912408378667

11 Regression Model Summary using Statsmodels


[15]: #loading the dataset
data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/advertising.
↪CSV")

#get the first 5 rows for last 5 rows use data.tail()


data.head()

[15]: TV Radio Newspaper Sales


0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9

[16]: #define x, y
x = data.drop(columns = ['Sales'], axis=1) #independent variables
y = data.Sales #target vaiable

[17]: #add constant to predictor variables


x = sm.add_constant(x)

#fit linear regression model


modelreg = sm.OLS(y, x).fit()

#view model summary


print(modelreg.summary())

data.shape

OLS Regression Results

4
==============================================================================
Dep. Variable: Sales R-squared: 0.903
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 605.4
Date: Sun, 21 Jan 2024 Prob (F-statistic): 8.13e-99
Time: 15:17:22 Log-Likelihood: -383.34
No. Observations: 200 AIC: 774.7
Df Residuals: 196 BIC: 787.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.6251 0.308 15.041 0.000 4.019 5.232
TV 0.0544 0.001 39.592 0.000 0.052 0.057
Radio 0.1070 0.008 12.604 0.000 0.090 0.124
Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012
==============================================================================
Omnibus: 16.081 Durbin-Watson: 2.251
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.655
Skew: -0.431 Prob(JB): 9.88e-07
Kurtosis: 4.605 Cond. No. 454.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.

[17]: (200, 4)

12 Linear Regression- Assumption Checking


1. Linearity
2. Homoscedasticity
3. Multivariate Normality
4. Independence
5. Lack of Multicollinearity

12.1 1. Linearity
[23]: fig,(ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(5, 12))

ax1.scatter(data['nox'],data['medv'])
ax1.set_title('nox-Nitric oxides concentration - parts per 10 million')
ax2.scatter(data['rm'],data['medv'])
ax2.set_title('rm-Average number of rooms per dwelling')
ax3.scatter(data['age'],data['medv'])

5
ax3.set_title('age-Proportion of owner-occupied units built prior to 1940')

plt.show()

6
7
12.2 2. Homoscedasticity
[24]: # Calculating residuals from the plot
residuals = y_test - y_pred

# Calculating standardized residuals


standardized_residuals = residuals / np.std(residuals)

[25]: # having Equal Variance


# We can check this by using a scatter plot, where the x-axis will have the␣
↪predictions,

# and the y-axis will have the residuals

plt.scatter(y_pred, standardized_residuals)
plt.xlabel("Predicted values")
plt.ylabel("Standardized Residuals")
plt.title("Residuals vs Fitted Values")
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal line at y=0 for␣
↪reference

plt.show()

8
12.3 3. Multivariate Normality
Normality of Residuals

12.4 a. Using a Histogram


[26]: # Fit a normal distribution to
# the residuals:
# mean and standard deviation
mu, std = norm.fit(residuals)

# Plot the histogram.


plt.hist(residuals, bins=30, density=True, alpha=0.6, color='b')

# Plot the PDF.


xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)

plt.plot(x, p, 'k', linewidth=2)

9
plt.show()

12.5 b. QQ Plot- Quantile-Quantile Plot


[27]: fig, ax =plt.subplots(figsize=(6,4))
sp.stats.probplot(residuals,plot=ax,fit=True)

plt.show()

10
[28]: # homework : apply the tests Kolmogorov-Smirnov test, Shapiro–Wilk test to␣
↪check multivariate normality.

12.6 4. Independence of observations -No autocorrelation of errors(residuals)


12.7 Durbin-Watson test
[29]: # test in graphical format
acf = smt.graphics.plot_acf(residuals, lags=40 , alpha=0.05)
acf.show()

C:\Users\ramaleer\AppData\Local\Temp\ipykernel_9772\3320808677.py:3:
UserWarning: Matplotlib is currently using
module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot
show the figure.
acf.show()

11
[30]: #perform Durbin-Watson test

durbin_watson(residuals)

[30]: 2.0260473760330022

Some notes on the Durbin-Watson test:


• the test statistic always has a value between 0 and 4
• value of 2 means that there is no autocorrelation in the sample
• values < 2 indicate positive autocorrelation, values > 2 negative autocorrelation

12.8 5. Lack of Multicollinearity


12.9 a.VIF -Variance Inflation Factor
[31]: vif=[]

for i in range(x_train.shape[1]):
vif.append(variance_inflation_factor(x_train,i))

pd.DataFrame({'Vif':vif}, index=data.columns[0:12])

12
[31]: Vif
crim 2.095894
zn 2.928062
indus 13.829768
nox 80.580602
rm 80.295207
age 22.821527
dis 14.784871
rad 14.694806
tax 57.284635
ptratio 87.191073
black 21.647351
lstat 11.319795

We can find the degree of correlation with the help of Variation Inflation Factor(VIF) It can be
interpreted as :
1 = Not correlated
1–5 = Moderately correlated
greater than 5 = Highly correlated

12.10 b. Heatmap Method -Correlation Matrix


[32]: fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data.iloc[:,0:12].corr(),annot=True,linewidths=.5, ax=ax)

[32]: <AxesSubplot:>

13
[33]: plt.scatter(data["rad"],data["tax"])
plt.xlabel("rad")
plt.ylabel("tax")
plt.show()

14
[34]: # Homework : Use Other Regression Methods to come up with predictive models for␣
↪mdev: House price

# (explore the scikit-learn library in python)

[ ]:

15

You might also like