Linear Regression

linear-regression
March 24, 2024
0.1 Importing needful libraries and modules

[1]: import pandas as pd
from sklearn.model_selection import train_test_split #For splitting the data␣
↪into train & test
from sklearn.linear_model import LinearRegression #Linear regression model

from sklearn.metrics import mean_squared_error #Metric for regression MSE
import statsmodels.api as sm #Regression Model Summary
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
import scipy as sp
import statsmodels.tsa.api as smt
[22]: #loading the dataset

data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/Boston.CSV")
#get the first 5 rows for last 5 rows use data.tail()

data.head()
[22]: crim zn indus nox rm age dis rad tax ptratio \

0 0.00632 18.0 2.31 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7
4 0.06905 0.0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7
black lstat medv

0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 5.33 36.2
1
Variable Description
• CRIM: Per capita crime rate by town
• ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS: Proportion of non-retail business acres per town.
• NOX: Nitric oxides concentration (parts per 10 million)
• RM: Average number of rooms per dwelling
• AGE: Proportion of owner-occupied units built prior to 1940
• DIS: Weighted distances to five Boston employment centers
• RAD: Index of accessibility to radial highways
• TAX: Full-value property-tax rate per 10,000 dollars
• PTRATIO: Pupil-teacher ratio by town
• BLACK: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT: percentage of lower status of the population
• MEDV: Median value of owner-occupied homes in 1,000’s dollars
Target Variable is Medv which is the House Price
[3]: data.shape
[3]: (506, 13)
[4]: #get the column and row count
print("Columns:",data.shape[1])
print("Rows:",data.shape[0])
Columns: 13
Rows: 506
1 Seperating independent data matrix & response vector

[35]: #define x, y
x = data.drop(columns = ['medv'], axis=1) #independent variables
y = data.medv #target vaiable
[ ]:
2 Splitting data into training & testing sets (Validation set ap-
proach)
[6]: # where training set is 20% of actual data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
2
3 Creating a linear regression model object
[7]: model=LinearRegression()
4 Training the model with training data

[8]: model.fit(x_train,y_train)
[8]: LinearRegression()
5 Estimated model parameters for independent variables

[9]: #coefficients of the independent variables
model.coef_
[9]: array([-1.25156811e-01, 4.48575121e-02, 2.01265385e-02, -1.62895522e+01,

3.77366409e+00, -2.13185120e-03, -1.40611846e+00, 2.66835482e-01,
-1.21692284e-02, -1.07660161e+00, 8.72533635e-03, -4.94425348e-01])
6 Intercept of the model

[10]: # beta-0 value
model.intercept_
[10]: 38.42706211257645
7 R-Squared value for the trained model

[11]: model.score(x_train,y_train)*100
[11]: 76.90953567794605
8 Predicting the response for the unseen testing independent data

[12]: y_pred=model.predict(x_test)
3
9 Mean Squared Error for the testing data
[13]: MSE=mean_squared_error(y_pred,y_test)
MSE
[13]: 34.21225254753325
10 Root Mean Squared Error

[14]: np.sqrt(MSE)
[14]: 5.84912408378667
11 Regression Model Summary using Statsmodels

[15]: #loading the dataset
data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/advertising.
↪CSV")
#get the first 5 rows for last 5 rows use data.tail()

data.head()
[15]: TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
[16]: #define x, y
x = data.drop(columns = ['Sales'], axis=1) #independent variables
y = data.Sales #target vaiable
[17]: #add constant to predictor variables

x = sm.add_constant(x)
#fit linear regression model

modelreg = sm.OLS(y, x).fit()
#view model summary

print(modelreg.summary())
data.shape
OLS Regression Results
4
==============================================================================
Dep. Variable: Sales R-squared: 0.903
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 605.4
Date: Sun, 21 Jan 2024 Prob (F-statistic): 8.13e-99
Time: 15:17:22 Log-Likelihood: -383.34
No. Observations: 200 AIC: 774.7
Df Residuals: 196 BIC: 787.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.6251 0.308 15.041 0.000 4.019 5.232
TV 0.0544 0.001 39.592 0.000 0.052 0.057
Radio 0.1070 0.008 12.604 0.000 0.090 0.124
Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012
==============================================================================
Omnibus: 16.081 Durbin-Watson: 2.251
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.655
Skew: -0.431 Prob(JB): 9.88e-07
Kurtosis: 4.605 Cond. No. 454.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[17]: (200, 4)
12 Linear Regression- Assumption Checking

1. Linearity
2. Homoscedasticity
3. Multivariate Normality
4. Independence
5. Lack of Multicollinearity
12.1 1. Linearity
[23]: fig,(ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(5, 12))
ax1.scatter(data['nox'],data['medv'])
ax1.set_title('nox-Nitric oxides concentration - parts per 10 million')
ax2.scatter(data['rm'],data['medv'])
ax2.set_title('rm-Average number of rooms per dwelling')
ax3.scatter(data['age'],data['medv'])
5
ax3.set_title('age-Proportion of owner-occupied units built prior to 1940')
plt.show()
6
7
12.2 2. Homoscedasticity
[24]: # Calculating residuals from the plot
residuals = y_test - y_pred
# Calculating standardized residuals

standardized_residuals = residuals / np.std(residuals)
[25]: # having Equal Variance

# We can check this by using a scatter plot, where the x-axis will have the␣
↪predictions,
# and the y-axis will have the residuals
plt.scatter(y_pred, standardized_residuals)
plt.xlabel("Predicted values")
plt.ylabel("Standardized Residuals")
plt.title("Residuals vs Fitted Values")
plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal line at y=0 for␣
↪reference
plt.show()
8
12.3 3. Multivariate Normality
Normality of Residuals
12.4 a. Using a Histogram

[26]: # Fit a normal distribution to
# the residuals:
# mean and standard deviation
mu, std = norm.fit(residuals)
# Plot the histogram.

plt.hist(residuals, bins=30, density=True, alpha=0.6, color='b')
# Plot the PDF.

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
9
plt.show()
12.5 b. QQ Plot- Quantile-Quantile Plot

[27]: fig, ax =plt.subplots(figsize=(6,4))
sp.stats.probplot(residuals,plot=ax,fit=True)
plt.show()
10
[28]: # homework : apply the tests Kolmogorov-Smirnov test, Shapiro–Wilk test to␣
↪check multivariate normality.
12.6 4. Independence of observations -No autocorrelation of errors(residuals)

12.7 Durbin-Watson test
[29]: # test in graphical format
acf = smt.graphics.plot_acf(residuals, lags=40 , alpha=0.05)
acf.show()
C:\Users\ramaleer\AppData\Local\Temp\ipykernel_9772\3320808677.py:3:
UserWarning: Matplotlib is currently using
module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot
show the figure.
acf.show()
11
[30]: #perform Durbin-Watson test
durbin_watson(residuals)
[30]: 2.0260473760330022
Some notes on the Durbin-Watson test:

• the test statistic always has a value between 0 and 4
• value of 2 means that there is no autocorrelation in the sample
• values < 2 indicate positive autocorrelation, values > 2 negative autocorrelation
12.8 5. Lack of Multicollinearity

12.9 a.VIF -Variance Inflation Factor
[31]: vif=[]
for i in range(x_train.shape[1]):
vif.append(variance_inflation_factor(x_train,i))
pd.DataFrame({'Vif':vif}, index=data.columns[0:12])
12
[31]: Vif
crim 2.095894
zn 2.928062
indus 13.829768
nox 80.580602
rm 80.295207
age 22.821527
dis 14.784871
rad 14.694806
tax 57.284635
ptratio 87.191073
black 21.647351
lstat 11.319795
We can find the degree of correlation with the help of Variation Inflation Factor(VIF) It can be
interpreted as :
1 = Not correlated
1–5 = Moderately correlated
greater than 5 = Highly correlated
12.10 b. Heatmap Method -Correlation Matrix

[32]: fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data.iloc[:,0:12].corr(),annot=True,linewidths=.5, ax=ax)
[32]: <AxesSubplot:>
13
[33]: plt.scatter(data["rad"],data["tax"])
plt.xlabel("rad")
plt.ylabel("tax")
plt.show()
14
[34]: # Homework : Use Other Regression Methods to come up with predictive models for␣
↪mdev: House price
# (explore the scikit-learn library in python)
[ ]:
15

Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

linear-regression

March 24, 2024

0.1 Importing needful libraries and modules

from sklearn.linear_model import LinearRegression #Linear regression model

[22]: #loading the dataset

#get the first 5 rows for last 5 rows use data.tail()

[22]: crim zn indus nox rm age dis rad tax ptratio \

black lstat medv

[3]: (506, 13)

[4]: #get the column and row count

1 Seperating independent data matrix & response vector

4 Training the model with training data

5 Estimated model parameters for independent variables

[9]: array([-1.25156811e-01, 4.48575121e-02, 2.01265385e-02, -1.62895522e+01,

6 Intercept of the model

7 R-Squared value for the trained model

8 Predicting the response for the unseen testing independent data

10 Root Mean Squared Error

11 Regression Model Summary using Statsmodels

#get the first 5 rows for last 5 rows use data.tail()

[15]: TV Radio Newspaper Sales

[17]: #add constant to predictor variables

#fit linear regression model

#view model summary

OLS Regression Results

12 Linear Regression- Assumption Checking

# Calculating standardized residuals

[25]: # having Equal Variance

# and the y-axis will have the residuals

12.4 a. Using a Histogram

# Plot the histogram.

# Plot the PDF.

plt.plot(x, p, 'k', linewidth=2)

12.5 b. QQ Plot- Quantile-Quantile Plot

12.6 4. Independence of observations -No autocorrelation of errors(residuals)

Some notes on the Durbin-Watson test:

12.8 5. Lack of Multicollinearity

12.10 b. Heatmap Method -Correlation Matrix

# (explore the scikit-learn library in python)

You might also like