FinalProject STAT4444

Jay Kapoor [Individual project]
STAT 4444
Prof. Hamdy Mahmood
10th April, 2022
Final Project - Analysis of Real Estate Data using simple linear regression and
bayesian simple linear regression.
- Introduction
We have a Real Estate dataset with 5 columns - Sale Price, Age of the house, Percent College,
Bedrooms and lot size. The original dataset has 350 observations but for as per the project
requirement we are only supposed to use 100 observations so we will use sample_n()
function in R to randomly select 100 observations.
Our goal for this analysis is to make a model using the frequentist approach and bayesian
approach to predict the sale price of a house. The sale price will be our response variable while
the other variables will be explanatory variables. After establishing predictive models we will
test their efficiency and robustness using different parameters. Finally, we will compare and
discuss the contrast between the frequentist model and the bayesian model to provide
conclusion. Software tools used in this project are RStudio and MiniTab
- Summarizing the Data
Computing the mean, median and the inter-quartile range which will depict the dispersion
of data for our variables.
We graph histograms to examine the distribution of each variable in the dataset. It should be
noted that the distribution of Sale Price is shaped like a bell curve which suggests that it is
Page 1
normally distributed which is aligns with our goal to predict the sale price and we can assume
normality for regression model.
Page 2
We Check for outliers in dataset using a scatter-plot. We find outliers in all of the graphs, and
upon further investigation we find that one of the houses listed for sale had 7 bedrooms and
had the largest lot size and was the one of the oldest house. So we one outlier in our dataset.
Page 3
Fitting the Regression Model
Regression equation:
Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size
The model summary gives us the diagnostic measures to check model fit and its predictive
capability. The R-sq is 27.55%, which suggest that the predictive capability of this model and
model fit is not good. We will use other measures such as AIC/BIC and PRESS to compare this
model with another one. The p- value for the regressors suggests that all the variables are
significant to the model, small VIF values indicate there is no colinearity.
Page 4
The analysis of variance indicates that number of bedrooms has the most impact on the
regression model, while Pct College and and age of the house have similar effect on the
regression.
The distribution of the residuals is normally distributed which is a fundamental assumption
of estimating a linear model. However, the normal probability plot suggests that the model fit
is not accurate and it is S-shaped. The variance in the deleted residuals indicate there is
constant variance upto some extent but there are some outliers which we can spot in versus
order & versus fits plot.
Page 5
Bayesian Simple Linear Regression
We implement a bayesian approach using R. Using the Bayesian framework, we can now
interpret credible intervals as the probabilities of the coefficients lying in such intervals.
Appropriate libraries have been loaded.
The median estimate and MAD_SD (median absolute deviation) are computed by the bayesian
model which uses MCMC (Monte Carlo) as a sampling tool. We plot the graphs for each
predictor which also have the median estimate for each predictor.
Page 6
Evaluating the model parameters (Posterior)
We evaluate the the model parameter by analyzing the posteriors using some specific
statistical measures. We use describe_posterior() function from the Bayes library for the
following output.
We have the 95% credible set for the model, which information about the uncertainty of the
regression coefficients. We will also use a equal-tailed confidence interval and the high -
density intervals.
Page 7
The interpretation of any such confidence interval in bayesian approach is that with 95%
probability (given the data) that a coefficient lies above the low value and under high value.
Another measure we analyze is the pd value or the Probability of Direction , it serves as
a p-value in the bayesian framework. It is the probability that the effect goes to the positive or
to the negative direction. The rhat value checks for convergence, we have value close to 1 for
every variable so we do not have any convergence problem with MCMC. The ESS is basically
the ‘Effective sample size’ generated by the MCMC for each, generally the higher the ESS the
better.
We also get the coefficient estimates, using the package insight.
Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size
The intercept values for our regression model and bayesian model are all nearly similar with
bayesian coefficients being on the higher side. The 95% confidence interval for both our model
are compared below. The frequentist method gives us a narrower confidence interval and we
can say that the linear model is better than the bayesian model.
Page 8
Appendix
The R code used for this project.
(RE100$Bedrooms)library(dplyr)
RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")
RealEstate
#Taking 100 samples from the main dataset
RE100 <- sample_n(RealEstate, 100)
names(RE100) <- make.names(names(RE100), unique=TRUE)
RE100
hist(RE100$Age)
hist(RE100$Pct.College)
hist
hist(RE100$Lot.Size)
hist(RE100$Sale.Price)
qqnorm(RE100$Age, main = "Normal Q-Q plot for Age")

qqnorm(RE100$Pct.College, main = "Normal Q-Q plot for Pct College")
qqnorm(RE100$Bedrooms, main = "Normal Q-Q plot for Bedrooms")
qqnorm(RE100$Lot.Size, main = "Normal Q-Q plot for Lot size")
qqnorm(RE100$Sale.Price, main = "Normal Q-Q plot for Sale Price")
summary(RE100)
install.packages("mlbench")
library(mlbench)
install.packages("rstanarm")
library(rstanarm)
install.packages("bayestestR")
library(bayestestR)
library(bayesplot)
library(insight)
library(broom)
RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")

RealEstate
model_bayes<- stan_glm(SalePrice~., data=RealEstate, seed=111)
print(model_bayes, digits = 3)
mcmc_dens(model_bayes, pars = c("Age"))+

vline_at(-805.089, col="red")
mcmc_dens(model_bayes, pars=c("PctCollege"))+
vline_at(1688.493, col="red")
Page 9
mcmc_dens(model_bayes, pars=c("Bedrooms"))+
mcmc_dens(model_bayes, pars=c("LotSize"))+
BIC(model_bayes)
post <- get_parameters(model_bayes)

print(purrr::map_dbl(post,median),digits = 3)
print(purrr::map_dbl(post, map_estimate),digits = 3)
hdi(model_bayes)
eti(model_bayes)
Page 10
Page 11

FinalProject STAT4444

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FinalProject STAT4444

Uploaded by

Copyright:

Available Formats

Jay Kapoor [Individual project]

function in R to randomly select 100 observations.

- Summarizing the Data

of data for our variables.

normality for regression model.

significant to the model, small VIF values indicate there is no colinearity.

The distribution of the residuals is normally distributed which is a fundamental assumption

order & versus fits plot.

Appropriate libraries have been loaded.

Another measure we analyze is the pd value or the Probability of Direction , it serves as

We also get the coefficient estimates, using the package insight.

The R code used for this project.

names(RE100) <- make.names(names(RE100), unique=TRUE)

qqnorm(RE100$Age, main = "Normal Q-Q plot for Age")

RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")

model_bayes<- stan_glm(SalePrice~., data=RealEstate, seed=111)

mcmc_dens(model_bayes, pars = c("Age"))+

post <- get_parameters(model_bayes)

You might also like