You are on page 1of 11

Jay Kapoor [Individual project]

STAT 4444
Prof. Hamdy Mahmood
10th April, 2022

Final Project - Analysis of Real Estate Data using simple linear regression and
bayesian simple linear regression.

- Introduction

We have a Real Estate dataset with 5 columns - Sale Price, Age of the house, Percent College,

Bedrooms and lot size. The original dataset has 350 observations but for as per the project

requirement we are only supposed to use 100 observations so we will use sample_n()

function in R to randomly select 100 observations.

Our goal for this analysis is to make a model using the frequentist approach and bayesian

approach to predict the sale price of a house. The sale price will be our response variable while

the other variables will be explanatory variables. After establishing predictive models we will

test their efficiency and robustness using different parameters. Finally, we will compare and

discuss the contrast between the frequentist model and the bayesian model to provide

conclusion. Software tools used in this project are RStudio and MiniTab

- Summarizing the Data

Computing the mean, median and the inter-quartile range which will depict the dispersion

of data for our variables.

We graph histograms to examine the distribution of each variable in the dataset. It should be

noted that the distribution of Sale Price is shaped like a bell curve which suggests that it is

Page 1
normally distributed which is aligns with our goal to predict the sale price and we can assume

normality for regression model.

Page 2
We Check for outliers in dataset using a scatter-plot. We find outliers in all of the graphs, and

upon further investigation we find that one of the houses listed for sale had 7 bedrooms and

had the largest lot size and was the one of the oldest house. So we one outlier in our dataset.

Page 3
Fitting the Regression Model

Regression equation:
Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size

The model summary gives us the diagnostic measures to check model fit and its predictive

capability. The R-sq is 27.55%, which suggest that the predictive capability of this model and

model fit is not good. We will use other measures such as AIC/BIC and PRESS to compare this

model with another one. The p- value for the regressors suggests that all the variables are

significant to the model, small VIF values indicate there is no colinearity.

Page 4
The analysis of variance indicates that number of bedrooms has the most impact on the

regression model, while Pct College and and age of the house have similar effect on the

regression.

The distribution of the residuals is normally distributed which is a fundamental assumption

of estimating a linear model. However, the normal probability plot suggests that the model fit

is not accurate and it is S-shaped. The variance in the deleted residuals indicate there is

constant variance upto some extent but there are some outliers which we can spot in versus

order & versus fits plot.

Page 5
Bayesian Simple Linear Regression

We implement a bayesian approach using R. Using the Bayesian framework, we can now

interpret credible intervals as the probabilities of the coefficients lying in such intervals.

Appropriate libraries have been loaded.

The median estimate and MAD_SD (median absolute deviation) are computed by the bayesian

model which uses MCMC (Monte Carlo) as a sampling tool. We plot the graphs for each

predictor which also have the median estimate for each predictor.

Page 6
Evaluating the model parameters (Posterior)

We evaluate the the model parameter by analyzing the posteriors using some specific

statistical measures. We use describe_posterior() function from the Bayes library for the

following output.

We have the 95% credible set for the model, which information about the uncertainty of the

regression coefficients. We will also use a equal-tailed confidence interval and the high -

density intervals.

Page 7
The interpretation of any such confidence interval in bayesian approach is that with 95%

probability (given the data) that a coefficient lies above the low value and under high value.

Another measure we analyze is the pd value or the Probability of Direction , it serves as

a p-value in the bayesian framework. It is the probability that the effect goes to the positive or

to the negative direction. The rhat value checks for convergence, we have value close to 1 for

every variable so we do not have any convergence problem with MCMC. The ESS is basically

the ‘Effective sample size’ generated by the MCMC for each, generally the higher the ESS the

better.

We also get the coefficient estimates, using the package insight.

Sale Price = -29298 - 804 Age + 1706 Pct College + 52706 Bedrooms + 2.265 Lot Size

The intercept values for our regression model and bayesian model are all nearly similar with

bayesian coefficients being on the higher side. The 95% confidence interval for both our model

are compared below. The frequentist method gives us a narrower confidence interval and we

can say that the linear model is better than the bayesian model.

Page 8
Appendix

The R code used for this project.

(RE100$Bedrooms)library(dplyr)
RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")
RealEstate
#Taking 100 samples from the main dataset
RE100 <- sample_n(RealEstate, 100)

names(RE100) <- make.names(names(RE100), unique=TRUE)

RE100
hist(RE100$Age)
hist(RE100$Pct.College)
hist
hist(RE100$Lot.Size)
hist(RE100$Sale.Price)

qqnorm(RE100$Age, main = "Normal Q-Q plot for Age")


qqnorm(RE100$Pct.College, main = "Normal Q-Q plot for Pct College")
qqnorm(RE100$Bedrooms, main = "Normal Q-Q plot for Bedrooms")
qqnorm(RE100$Lot.Size, main = "Normal Q-Q plot for Lot size")
qqnorm(RE100$Sale.Price, main = "Normal Q-Q plot for Sale Price")

summary(RE100)

install.packages("mlbench")
library(mlbench)
install.packages("rstanarm")
library(rstanarm)
install.packages("bayestestR")
library(bayestestR)
library(bayesplot)
library(insight)
library(broom)

RealEstate <- read_excel("D:/STAT 4444/RealEstate.xls")


RealEstate

model_bayes<- stan_glm(SalePrice~., data=RealEstate, seed=111)

print(model_bayes, digits = 3)

mcmc_dens(model_bayes, pars = c("Age"))+


vline_at(-805.089, col="red")

mcmc_dens(model_bayes, pars=c("PctCollege"))+
vline_at(1688.493, col="red")

Page 9
mcmc_dens(model_bayes, pars=c("Bedrooms"))+
vline_at(52794.568, col="red")

mcmc_dens(model_bayes, pars=c("LotSize"))+
vline_at(2.260, col="red")

BIC(model_bayes)

post <- get_parameters(model_bayes)


print(purrr::map_dbl(post,median),digits = 3)

print(purrr::map_dbl(post, map_estimate),digits = 3)

hdi(model_bayes)
eti(model_bayes)

Page 10
Page 11

You might also like