You are on page 1of 21

Boston Housing Price Prediction

Overview
In this report I will apply the regression models that I have implemented to accurately
predict the housing prices in Boston suburbs. The dataset for this experiment is accessed
from the UCI Machine Learning repository via
https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as
to demonstrate the entire process right from getting and cleaning the data, to exploratory
analysis of the dataset to understand the distribution and importance of various features in
influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of
the models, etc.

Introduction
The dataset consists of 506 observations of 14 attributes. The median value of house price
in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below
is a brief description of each feature and the outcome in our dataset:
CRIM – per capita crime rate by town ZN – proportion of residential land zoned for lots
over 25,000 sq.ft INDUS – proportion of non-retail business acres per town CHAS – Charles
River dummy variable (1 if tract bounds river; else 0) NOX – nitric oxides concentration
(parts per 10 million) RM – average number of rooms per dwelling AGE – proportion of
owner-occupied units built prior to 1940 DIS – weighted distances to five Boston
employment centres RAD – index of accessibility to radial highways TAX – full-value
property-tax rate per $10,000 PTRATIO – pupil-teacher ratio by town B – 1000(Bk -
0.63)^2 where Bk is the proportion of blacks by town LSTAT – % lower status of the
population MEDV – Median value of owner-occupied homes in $1000’s

Package Required
library(ggplot2)
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(tidyverse)
## -- Attaching packages ------------------------------------------------
tidyverse 1.2.1 --

## v tibble 2.1.3 v purrr 0.3.2


## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0

## -- Conflicts ---------------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

library(corrplot)

## corrplot 0.84 loaded

library(rpart)
library (caTools)
library (sp)
library (raster)

##
## Attaching package: 'raster'

## The following object is masked from 'package:tidyr':


##
## extract

## The following object is masked from 'package:dplyr':


##
## select

library(usdm)
library(lmtest)

## Loading required package: zoo

##
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':


##
## as.Date, as.Date.numeric

library (sandwich)
library(broom)
options (scipen=99999)

Getting and Cleaning the Data


myData <- read.csv ("E:/imi/BA data/Boston.csv",header=TRUE)
df<-myData

Looking the data


dim(df)

## [1] 506 15

head(df)

## X crim zn indus chas nox rm age dis rad tax ptratio black
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7

Removing the first column


df<- df[,-1]

Structure of the data


str(df)

## 'data.frame': 506 obs. of 14 variables:


## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524
0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(df)

## crim zn indus chas


## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00

IDENTIFY MISSING VALUES


sum(is.na(df))

## [1] 0

IDENTIFY OUTLIERS & TREAT


par(mfrow = c(1,2))
boxplot(df)

abline(h = min(df), col = "Blue")


abline(h = max(df), col = "Yellow")

abline(h = quantile(df$medv, c(0.25, 0.75)), col = "Red")


abline(h = quantile(df$black, c(0.25, 0.75)), col = "Red")

#### OUTLIERS TREATMENT, CAPPING OUTLIERS

caps <- quantile(df$medv, probs=c(.05, .95), na.rm = T)


df$medv <- ifelse (df$medv < caps[1],caps[1],df$medv)
df$medv <- ifelse (df$medv > caps[2],caps[2],df$medv)
caps <- quantile(df$crim, probs=c(.05, .95), na.rm = T)
df$crim <- ifelse (df$crim < caps[1],caps[1],df$crim)
df$crim <- ifelse (df$crim > caps[2],caps[2],df$crim)

caps <- quantile(df$indus, probs=c(.05, .95), na.rm = T)


df$indus <- ifelse (df$indus < caps[1],caps[1],df$indus)
df$indus <- ifelse (df$indus > caps[2],caps[2],df$indus)

caps <- quantile(df$dis, probs=c(.05, .95), na.rm = T)


df$dis <- ifelse (df$dis < caps[1],caps[1],df$dis)
df$dis <- ifelse (df$dis > caps[2],caps[2],df$dis)

caps <- quantile(df$rad, probs=c(.05, .95), na.rm = T)


df$rad <- ifelse (df$rad < caps[1],caps[1],df$rad)
df$rad <- ifelse (df$rad > caps[2],caps[2],df$rad)

caps <- quantile(df$tax, probs=c(.05, .95), na.rm = T)


df$tax <- ifelse (df$tax < caps[1],caps[1],df$tax)
df$tax <- ifelse (df$tax > caps[2],caps[2],df$tax)

caps <- quantile(df$zn, probs=c(.05, .95), na.rm = T)


df$zn <- ifelse (df$zn < caps[1],caps[1],df$zn)
df$zn <- ifelse (df$zn > caps[2],caps[2],df$zn)

caps <- quantile(df$black, probs=c(.05, .95), na.rm = T)


df$black <- ifelse (df$black < caps[1],caps[1],df$black)
df$black <- ifelse (df$black > caps[2],caps[2],df$black)

boxplot(df)
Some outliers were
present in medv, crim , indus, zn, dis, rad, tax and black. They are treated by replacing the
extreme values by 95 and 5 percentile values.

Data Exploration
#checking correlation between variables
par(mfrow=c(1,1))
corrplot(cor(df), method = "number", type = "upper", diag = FALSE)
There are no
missing values.
From correlation matrix, some of the observations made are as follows:
1. Median value of owner-occupied homes (in 1000$) increases as average number of
rooms per dwelling increases and it decreases if percent of lower status population in
the area increases
2. nox or nitrogen oxides concentration (ppm) increases with increase in proportion of
non-retail business acres per town and proportion of owner-occupied units built prior
to 1940.
3. rad and tax have a strong positive correlation of 0.92 which implies that as
accessibility of radial highways increases, the full value property-tax rate per $10,000
also increases.
4. crim is strongly associated with variables rad and tax which implies as accessibility to
radial highways increases, per capita crime rate increases.
5. indus has strong positive correlation with nox, which supports the notion that
nitrogen oxides concentration is high in industrial areas

Data Visualization
df %>%
gather(key, val, -medv) %>%
ggplot(aes(x = val, y = medv)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE, col = "blue") +
facet_wrap(~key, scales = "free") +
theme_gray() +
ggtitle("Scatter plot of dependent variables vs Median Value (medv)")

table(df$chas)

##
## 0 1
## 471 35

#Observations
1. Proportion of owner occupied units built prior to 1940 (age) and proportion of blacks by
town (black) is heavily skewed to left, while per capita crime rate in town (crim) and
weighted mean of distances to five Boston employment centres (dis) is heavily skewed to
right. 2. rm is normally distributed with mean of approximately 6. 3. Most of the properties
are situated close to the five Boston employment centres (dis skewed to right) 4. There is a
high proportion of owner occupied units built prior to 1940 (age skewed to left) and blacks
in town (black skewed to right) 5. From scatter plots, it is seen that lstat and rm show
strong correlation with medv. 6. 93% of the properties are away from Charles river. The
properties bordering the river seems to have higher median prices.

Building Linear Regression Model


model1 <- lm(medv ~ ., data = df)
model1.sum <- summary(model1)
model1.sum
##
## Call:
## lm(formula = medv ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2319 -2.5572 -0.4885 1.5944 20.5989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.799995 4.600436 8.434 0.000000000000000374 ***
## crim -0.190658 0.089359 -2.134 0.033367 *
## zn 0.033755 0.012335 2.736 0.006434 **
## indus -0.023110 0.058157 -0.397 0.691259
## chas 2.039352 0.752394 2.710 0.006953 **
## nox -17.873037 3.416867 -5.231 0.000000250130127140 ***
## rm 3.453089 0.362500 9.526 < 0.0000000000000002 ***
## age -0.012814 0.011593 -1.105 0.269575
## dis -1.591317 0.195170 -8.154 0.000000000000002960 ***
## rad 0.304452 0.068996 4.413 0.000012560653430629 ***
## tax -0.011082 0.003523 -3.146 0.001756 **
## ptratio -0.945110 0.113561 -8.323 0.000000000000000856 ***
## black 0.009780 0.002648 3.694 0.000246 ***
## lstat -0.438845 0.045729 -9.597 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.122 on 492 degrees of freedom
## Multiple R-squared: 0.7597, Adjusted R-squared: 0.7534
## F-statistic: 119.6 on 13 and 492 DF, p-value: < 0.00000000000000022

#Looking at model summary, we see that variables crim, indus and age are insignificant
#Building model without variables indus and age, and convering medv into log
medv_n<- log(df$medv)
df<- cbind(df,medv_n)
model2 <- lm(medv_n ~ .-indus -age -medv, data = df)
model2.sum <- summary(model2)
model2.sum

##
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60839 -0.10172 -0.00987 0.07779 0.76885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1016772 0.1844688 22.235 < 0.0000000000000002 ***
## crim -0.0198838 0.0035893 -5.540 0.0000000493272615 ***
## zn 0.0010235 0.0004865 2.104 0.03590 *
## chas 0.0816225 0.0299808 2.722 0.00671 **
## nox -0.8390664 0.1282158 -6.544 0.0000000001499610 ***
## rm 0.0908910 0.0142014 6.400 0.0000000003611354 ***
## dis -0.0572489 0.0073013 -7.841 0.0000000000000278 ***
## rad 0.0162700 0.0027002 6.026 0.0000000032978212 ***
## tax -0.0005146 0.0001269 -4.054 0.0000584828599646 ***
## ptratio -0.0396055 0.0045315 -8.740 < 0.0000000000000002 ***
## black 0.0004556 0.0001061 4.294 0.0000211708505073 ***
## lstat -0.0229497 0.0017263 -13.294 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.7934
## F-statistic: 177.3 on 11 and 494 DF, p-value: < 0.00000000000000022

Multicolinerity test
#Checking vif
vif(df[,-c(3,7,14,15)])

## Variables VIF
## 1 crim 5.165720
## 2 zn 2.180946
## 3 chas 1.067136
## 4 nox 4.062275
## 5 rm 1.832256
## 6 dis 3.713727
## 7 rad 10.086903
## 8 tax 8.217313
## 9 ptratio 1.771196
## 10 black 1.368923
## 11 lstat 2.796591

Multicolinearity between tax and rad as VIF > 10 so we will remove tax because rad is more
significant
#Building model without variables indus, age and tax
model3 <- lm(medv_n ~ .-indus -age -tax -medv, data = df)
model3.sum <- summary(model2)
model3.sum

##
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60839 -0.10172 -0.00987 0.07779 0.76885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1016772 0.1844688 22.235 < 0.0000000000000002 ***
## crim -0.0198838 0.0035893 -5.540 0.0000000493272615 ***
## zn 0.0010235 0.0004865 2.104 0.03590 *
## chas 0.0816225 0.0299808 2.722 0.00671 **
## nox -0.8390664 0.1282158 -6.544 0.0000000001499610 ***
## rm 0.0908910 0.0142014 6.400 0.0000000003611354 ***
## dis -0.0572489 0.0073013 -7.841 0.0000000000000278 ***
## rad 0.0162700 0.0027002 6.026 0.0000000032978212 ***
## tax -0.0005146 0.0001269 -4.054 0.0000584828599646 ***
## ptratio -0.0396055 0.0045315 -8.740 < 0.0000000000000002 ***
## black 0.0004556 0.0001061 4.294 0.0000211708505073 ***
## lstat -0.0229497 0.0017263 -13.294 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.7934
## F-statistic: 177.3 on 11 and 494 DF, p-value: < 0.00000000000000022

#Again Checking the VIF


vif(df[,-c(3,7,10,14,15)])

## Variables VIF
## 1 crim 5.151328
## 2 zn 2.085299
## 3 chas 1.057882
## 4 nox 3.870185
## 5 rm 1.802072
## 6 dis 3.648244
## 7 rad 5.053853
## 8 ptratio 1.734311
## 9 black 1.365118
## 10 lstat 2.790153

No multicolinearity as VIF < 10

Heteroscedasticity Test
par(mfrow=c(2,2))
plot(model3)
The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is
standardised residuals on Y axis. If there is absolutely no heteroscedastity, you should see a
completely random, equal distribution of points throughout the range of X axis and a flat
red line.

White Test
bptest (df$medv_n ~ . -indus -age -tax -medv, data=df)

##
## studentized Breusch-Pagan test
##
## data: df$medv_n ~ . - indus - age - tax - medv
## BP = 65.679, df = 10, p-value = 0.0000000003004

These test have a p-value less that a significance level of 0.05, therefore we can reject the
null hypothesis that the variance of the residuals is constant and infer that
heteroscedasticity is indeed present.

Newey West Covariance Matrix Estimation


#Newey & West (1994) compute this type of estimator
NeweyWest(model2)

## (Intercept) crim zn
## (Intercept) 0.24722073572 -0.00020147280319 0.000121839494212
## crim -0.00020147280 0.00002717475901 -0.000000495284715
## zn 0.00012183949 -0.00000049528472 0.000000310655085
## chas 0.00594573689 0.00002746371119 0.000003483520746
## nox -0.07593735688 -0.00024907019539 -0.000020269791427
## rm -0.02021116925 0.00003343811922 -0.000013785034454
## dis -0.00539611192 -0.00000369181679 -0.000004505235715
## rad 0.00110797481 -0.00000730530231 0.000000729695327
## tax -0.00001304137 0.00000009798847 -0.000000005550768
## ptratio -0.00138405476 0.00000227806362 -0.000000041169849
## black -0.00005033149 0.00000024972438 -0.000000003947017
## lstat -0.00141357677 -0.00000070768295 -0.000000857040920
## chas nox rm
## (Intercept) 0.0059457368916 -0.075937356885 -0.020211169250
## crim 0.0000274637112 -0.000249070195 0.000033438119
## zn 0.0000034835207 -0.000020269791 -0.000013785034
## chas 0.0017575470459 -0.002992160421 -0.000577625022
## nox -0.0029921604209 0.046421861487 0.004074314837
## rm -0.0005776250220 0.004074314837 0.001992745599
## dis -0.0001349409404 0.001951391237 0.000439792236
## rad 0.0000256620395 -0.000211005671 -0.000106347900
## tax 0.0000006036050 -0.000001105169 0.000001308947
## ptratio -0.0000081232422 0.000531050499 0.000078676997
## black -0.0000000844469 0.000021361623 0.000002039138
## lstat -0.0000497269860 0.000182604917 0.000154019325
## dis rad tax
## (Intercept) -0.0053961119212 0.0011079748064 -0.000013041369444
## crim -0.0000036918168 -0.0000073053023 0.000000097988465
## zn -0.0000045052357 0.0000007296953 -0.000000005550768
## chas -0.0001349409404 0.0000256620395 0.000000603604982
## nox 0.0019513912371 -0.0002110056711 -0.000001105169385
## rm 0.0004397922362 -0.0001063479003 0.000001308947370
## dis 0.0001837258430 -0.0000234413321 0.000000379965888
## rad -0.0000234413321 0.0000119242973 -0.000000222886748
## tax 0.0000003799659 -0.0000002228867 0.000000014814894
## ptratio 0.0000128089027 -0.0000053109613 -0.000000107799449
## black 0.0000006260831 -0.0000001140361 0.000000002629797
## lstat 0.0000353562840 -0.0000073588948 0.000000056134491
## ptratio black lstat
## (Intercept) -0.00138405476464 -0.000050331488083 -0.00141357676841
## crim 0.00000227806362 0.000000249724375 -0.00000070768295
## zn -0.00000004116985 -0.000000003947017 -0.00000085704092
## chas -0.00000812324218 -0.000000084446898 -0.00004972698599
## nox 0.00053105049899 0.000021361623286 0.00018260491654
## rm 0.00007867699678 0.000002039138022 0.00015401932509
## dis 0.00001280890273 0.000000626083093 0.00003535628405
## rad -0.00000531096128 -0.000000114036130 -0.00000735889477
## tax -0.00000010779945 0.000000002629797 0.00000005613449
## ptratio 0.00002988973480 0.000000227264892 -0.00000014200951
## black 0.00000022726489 0.000000049161470 0.00000005832321
## lstat -0.00000014200951 0.000000058323214 0.00001840695144
#The Newey & West (1987) estimator requires specification of the lag and
suppression of prewhitening
NeweyWest(model2, lag = 4, prewhite = FALSE)

## (Intercept) crim zn
## (Intercept) 0.18394330522 -0.00013734230539 0.000082656882999
## crim -0.00013734231 0.00002618724369 -0.000000407121320
## zn 0.00008265688 -0.00000040712132 0.000000249649886
## chas 0.00241758560 0.00001910099026 0.000002166000678
## nox -0.05432961206 -0.00019257733054 -0.000013316979201
## rm -0.01477209213 0.00002420552246 -0.000009557232357
## dis -0.00381377071 -0.00000069276287 -0.000003300631247
## rad 0.00074927438 -0.00000778427464 0.000000493657013
## tax -0.00001094528 0.00000006230308 -0.000000005021864
## ptratio -0.00110386996 0.00000178318818 0.000000078472813
## black -0.00004220202 0.00000016982678 -0.000000003201793
## lstat -0.00101635266 -0.00000026924198 -0.000000552367980
## chas nox rm
## (Intercept) 0.0024175855959 -0.0543296120606 -0.0147720921263
## crim 0.0000191009903 -0.0001925773305 0.0000242055225
## zn 0.0000021660007 -0.0000133169792 -0.0000095572324
## chas 0.0019347233018 -0.0021779825017 -0.0002610757484
## nox -0.0021779825017 0.0367629360939 0.0026346547780
## rm -0.0002610757484 0.0026346547780 0.0014500834959
## dis -0.0000963320145 0.0014656878606 0.0002988117187
## rad 0.0000166856913 -0.0001370064198 -0.0000708539260
## tax 0.0000010847932 -0.0000009781616 0.0000009868642
## ptratio 0.0000179410785 0.0003790017552 0.0000638659601
## black 0.0000004989522 0.0000160304765 0.0000018777980
## lstat -0.0000342184642 0.0000715857446 0.0001129414758
## dis rad tax
## (Intercept) -0.0038137707101 0.0007492743830 -0.000010945278520
## crim -0.0000006927629 -0.0000077842746 0.000000062303084
## zn -0.0000033006312 0.0000004936570 -0.000000005021864
## chas -0.0000963320145 0.0000166856913 0.000001084793232
## nox 0.0014656878606 -0.0001370064198 -0.000000978161632
## rm 0.0002988117187 -0.0000708539260 0.000000986864245
## dis 0.0001311580386 -0.0000158571929 0.000000207647457
## rad -0.0000158571929 0.0000099532191 -0.000000200287214
## tax 0.0000002076475 -0.0000002002872 0.000000013859207
## ptratio 0.0000114666664 -0.0000038243527 -0.000000019921459
## black 0.0000005173355 -0.0000000811543 0.000000002735151
## lstat 0.0000230525902 -0.0000046752323 -0.000000001891525
## ptratio black lstat
## (Intercept) -0.00110386995875 -0.000042202024094 -0.001016352657991
## crim 0.00000178318818 0.000000169826780 -0.000000269241977
## zn 0.00000007847281 -0.000000003201793 -0.000000552367980
## chas 0.00001794107851 0.000000498952187 -0.000034218464196
## nox 0.00037900175518 0.000016030476545 0.000071585744554
## rm 0.00006386596012 0.000001877798024 0.000112941475791
## dis 0.00001146666641 0.000000517335489 0.000023052590245
## rad -0.00000382435272 -0.000000081154296 -0.000004675232328
## tax -0.00000001992146 0.000000002735151 -0.000000001891525
## ptratio 0.00002253698112 0.000000183953549 -0.000000060676354
## black 0.00000018395355 0.000000039573141 0.000000086822108
## lstat -0.00000006067635 0.000000086822108 0.000015254638658

#bwNeweyWest() can also be passed to kernHAC(), e.g.for the quadratic


spectral kernel
kernHAC(model2, bw = bwNeweyWest)

## (Intercept) crim zn
## (Intercept) 0.25344611596 -0.0001977417080 0.000121300204680
## crim -0.00019774171 0.0000289402385 -0.000000500942137
## zn 0.00012130020 -0.0000005009421 0.000000325974127
## chas 0.00568860484 0.0000372986992 0.000003266568628
## nox -0.07829546882 -0.0002941205017 -0.000020048828543
## rm -0.02054541170 0.0000360646698 -0.000013844994593
## dis -0.00555306675 -0.0000052033503 -0.000004638882583
## rad 0.00110968572 -0.0000074767052 0.000000719979977
## tax -0.00001289919 0.0000001122660 -0.000000004672457
## ptratio -0.00140017744 0.0000021573574 0.000000014255484
## black -0.00005386745 0.0000002596249 -0.000000003875833
## lstat -0.00146639634 -0.0000004760912 -0.000000873842837
## chas nox rm
## (Intercept) 0.0056886048387 -0.078295468823 -0.020545411701
## crim 0.0000372986992 -0.000294120502 0.000036064670
## zn 0.0000032665686 -0.000020048829 -0.000013844995
## chas 0.0018985948663 -0.003092251196 -0.000552624756
## nox -0.0030922511958 0.048918178229 0.004126444960
## rm -0.0005526247557 0.004126444960 0.002016382154
## dis -0.0001337991719 0.002040090488 0.000449974214
## rad 0.0000231600848 -0.000210959074 -0.000106490415
## tax 0.0000005991687 -0.000001345769 0.000001294066
## ptratio -0.0000092700183 0.000530477276 0.000078388740
## black 0.0000002403154 0.000022341197 0.000002223894
## lstat -0.0000454303913 0.000192999181 0.000158534206
## dis rad tax
## (Intercept) -0.005553066746 0.0011096857228 -0.000012899187270
## crim -0.000005203350 -0.0000074767052 0.000000112266048
## zn -0.000004638883 0.0000007199800 -0.000000004672457
## chas -0.000133799172 0.0000231600848 0.000000599168709
## nox 0.002040090488 -0.0002109590739 -0.000001345768997
## rm 0.000449974214 -0.0001064904150 0.000001294066084
## dis 0.000192012364 -0.0000233664310 0.000000372908016
## rad -0.000023366431 0.0000120381584 -0.000000225200882
## tax 0.000000372908 -0.0000002252009 0.000000015117500
## ptratio 0.000011989274 -0.0000051912936 -0.000000121021351
## black 0.000000652395 -0.0000001175325 0.000000003172883
## lstat 0.000037027942 -0.0000075220851 0.000000057815038
## ptratio black lstat
## (Intercept) -0.00140017743831 -0.000053867453755 -0.00146639634131
## crim 0.00000215735744 0.000000259624923 -0.00000047609121
## zn 0.00000001425548 -0.000000003875833 -0.00000087384284
## chas -0.00000927001834 0.000000240315393 -0.00004543039134
## nox 0.00053047727608 0.000022341197084 0.00019299918110
## rm 0.00007838873985 0.000002223894390 0.00015853420585
## dis 0.00001198927429 0.000000652394978 0.00003702794171
## rad -0.00000519129364 -0.000000117532462 -0.00000752208512
## tax -0.00000012102135 0.000000003172883 0.00000005781504
## ptratio 0.00003057456294 0.000000254951918 0.00000008819678
## black 0.00000025495192 0.000000051861562 0.00000006053389
## lstat 0.00000008819678 0.000000060533885 0.00001895824589

# Newey West Adjusted Regression Estimation


ht<-coeftest(model3, df = Inf, vcov = NeweyWest(model3, lag = 4, prewhite =
FALSE))
ht

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.01360943 0.43549104 9.2163 < 0.00000000000000022 ***
## crim -0.01911578 0.00522037 -3.6618 0.0002505 ***
## zn 0.00061046 0.00051527 1.1847 0.2361233
## chas 0.09294084 0.04268523 2.1774 0.0294542 *
## nox -0.95209639 0.19764981 -4.8171 0.00000145669 ***
## rm 0.09828034 0.03886071 2.5290 0.0114375 *
## dis -0.05331846 0.01171117 -4.5528 0.00000529399 ***
## rad 0.00853763 0.00265633 3.2141 0.0013087 **
## ptratio -0.04225658 0.00485076 -8.7113 < 0.00000000000000022 ***
## black 0.00047828 0.00019916 2.4015 0.0163264 *
## lstat -0.02328544 0.00397583 -5.8567 0.00000000472 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see zn is insignificant so #Building model without variables indus, age, tax and
zn
model4 <- lm(medv_n ~ .-indus -age -zn -tax -medv, data = df)
model4.sum <- summary(model4)
model4.sum

##
## Call:
## lm(formula = medv_n ~ . - indus - age - zn - tax - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58062 -0.10397 -0.00716 0.09115 0.77994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0279688 0.1857837 21.681 < 0.0000000000000002 ***
## crim -0.0184431 0.0036028 -5.119 0.000000439941172 ***
## chas 0.0923608 0.0303271 3.045 0.00245 **
## nox -0.9619543 0.1269205 -7.579 0.000000000000172 ***
## rm 0.1002435 0.0142258 7.047 0.000000000006165 ***
## dis -0.0490981 0.0065499 -7.496 0.000000000000305 ***
## rad 0.0085632 0.0019419 4.410 0.000012703020230 ***
## ptratio -0.0439714 0.0043492 -10.110 < 0.0000000000000002 ***
## black 0.0004792 0.0001077 4.450 0.000010589912577 ***
## lstat -0.0233657 0.0017509 -13.345 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1683 on 496 degrees of freedom
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7867
## F-statistic: 207.9 on 9 and 496 DF, p-value: < 0.00000000000000022

All variables ara significant with p<0.05.


#Again run the Newey West Adjusted Regression Estimation
ht<-coeftest(model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite =
FALSE))
ht

##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.02796876 0.44209318 9.1111 < 0.00000000000000022 ***
## crim -0.01844308 0.00517040 -3.5671 0.000361 ***
## chas 0.09236082 0.04287646 2.1541 0.031231 *
## nox -0.96195431 0.19956954 -4.8201 0.000001434532 ***
## rm 0.10024354 0.03844047 2.6078 0.009114 **
## dis -0.04909814 0.01007714 -4.8722 0.000001103473 ***
## rad 0.00856322 0.00268343 3.1911 0.001417 **
## ptratio -0.04397140 0.00501692 -8.7646 < 0.00000000000000022 ***
## black 0.00047917 0.00019930 2.4042 0.016206 *
## lstat -0.02336565 0.00402656 -5.8029 0.000000006519 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

res<- resid (model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite =


FALSE))
df1<-cbind(df,res)
bptest (df1$medv_n ~ .-indus -age -zn -tax -medv , data=df1, studentize =
TRUE)
##
## studentized Breusch-Pagan test
##
## data: df1$medv_n ~ . - indus - age - zn - tax - medv
## BP = 13.13, df = 10, p-value = 0.2165

#P > 0.05 so we will not reject the null hyphothesis so there is no hetroscasticity in the
model

Autocorrelation Check
#Durbin Watson Test
dwtest (df1$medv_n ~ .-indus -age -zn -tax -medv,data=df1)

##
## Durbin-Watson test
##
## data: df1$medv_n ~ . - indus - age - zn - tax - medv
## DW = 2.2715, p-value = 0.9971
## alternative hypothesis: true autocorrelation is greater than 0

This test gives p value greater than 0.5 so we will not reject the null hypothesis So there is
no anutocorrelation present in the data.

Multicolinearity check
vif(df[,-c(2,3,7,10,14,15)])

## Variables VIF
## 1 crim 5.041150
## 2 chas 1.057640
## 3 nox 3.855603
## 4 rm 1.780811
## 5 dis 2.894817
## 6 rad 5.053303
## 7 ptratio 1.580296
## 8 black 1.365060
## 9 lstat 2.786487

Cross-Section Data: Random Split: TRAIN AND TEST (70:30)


set.seed(1234)
split1=sample.split(df$medv,SplitRatio=0.70)
#train
train=subset(df,split1==TRUE)
#test
test=subset(df,split1==FALSE)
Re-run Regression
m1=lm(train$medv_n ~ .-indus -age -zn -tax -medv, data=train)

#Extracting the standard robst error from coefftest


k <- m1 %>% coeftest(m1, df = Inf, vcov = NeweyWest(m1, lag = 4, prewhite =
FALSE)) %>% tidy()
k

## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.55 0.433 8.21 2.22e-16
## 2 crim -0.00703 0.00509 -1.38 1.67e- 1
## 3 chas 0.101 0.0442 2.29 2.20e- 2
## 4 nox -0.963 0.220 -4.37 1.23e- 5
## 5 rm 0.147 0.0355 4.14 3.54e- 5
## 6 dis -0.0452 0.00993 -4.56 5.20e- 6
## 7 rad 0.00377 0.00278 1.36 1.75e- 1
## 8 ptratio -0.0368 0.00495 -7.44 9.88e-14
## 9 black 0.000562 0.000228 2.46 1.38e- 2
## 10 lstat -0.0220 0.00386 -5.70 1.19e- 8

#Replacing lm function coefficient with coeftest coefficient


m1$coefficients<- k$estimate
summary(m1)

##
## Call:
## lm(formula = train$medv_n ~ . - indus - age - zn - tax - medv,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40813 -0.10301 -0.01043 0.08599 0.77045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## [1,] 3.5530220 0.2114260 16.805 < 0.0000000000000002 ***
## [2,] -0.0070322 0.0041823 -1.681 0.09356 .
## [3,] 0.1012310 0.0316543 3.198 0.00151 **
## [4,] -0.9633140 0.1472050 -6.544 0.000000000210234 ***
## [5,] 0.1467302 0.0160457 9.145 < 0.0000000000000002 ***
## [6,] -0.0452316 0.0074374 -6.082 0.000000003084784 ***
## [7,] 0.0037667 0.0021876 1.722 0.08597 .
## [8,] -0.0368152 0.0049081 -7.501 0.000000000000516 ***
## [9,] 0.0005618 0.0001250 4.495 0.000009419998084 ***
## [10,] -0.0219837 0.0020088 -10.944 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 354 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8152
## F-statistic: 178.9 on 9 and 354 DF, p-value: < 0.00000000000000022

#List of residuals
res1<-resid(m1)
par(mfrow=c(2,2))
plot(m1)

As the plot between residual and fitted becomes linear so there is no Heteroscedasticity
present in the data.

Train data performance


#MSE
model.sum <- summary(m1)
(model.sum$sigma) ^ 2

## [1] 0.02504147

Mean square error in the trained data is 2.5%


Out-of-sample Prediction or test error (MSPE)
model4.pred.test <- predict(m1, newdata = test)
model4.mspe <- mean((model4.pred.test - test$medv_n) ^ 2)
c(RMSE = model4.mspe, R2 = summary(m1)$r.squared)

## RMSE R2
## 0.04096198 0.81976799

Mean Square error in the test dataset is 4.09% and R- square value is 81.97%

Conclusion
We have applied linear regression model with different variables and the best model was
the varaibles without crim, indus and age which are insignificant. There is
Heteroscedasticity present in the model which was removed by adding residual variable in
the dataset.

You might also like