Housepriceprediction StatisticalAnalysis

A Project Report on
A Statistical Analysis on
House Price Prediction and
Understanding its
Determinants using Machine
Learning Techniques
By:
1. Subhodip Pal
2. Debpratik Ghosh
3. Sabyasachi Bhar
- 2021
TABLE OF CONTENTS
ABSTRACT 2
1. INTRODUCTION 3
1.1.Problem Statement 3
1.2.Focus of the Study 3
1.3.Data Source 3
1.4. Industry Overview 3
2. LITERATURE REVIEW 5
3. MAIN TEXT 14
3.1. Background of the Study 9
3.2.Details of the Study 9
4. DATA ANALYSIS, VISUALIZATION AND INTERPRETATION 12

4.1. Missing Value Analysis 12
4.2. Outlier Detection 13
4.3. Factor Analysis 17
4.4. Multiple Linear Regression 24
4.5. Data Visualization 37
5. RESULT AND DISCUSSION 40
6. CONCLUSION 42
7. LIMITATIONS OF THE STUDY 44
8. KEY LEARNING FROM THE PROJECT 45
9. REFERENCES 46
10. APPENDICES 48
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 1
ABSTRACT
Real estate industry is one of the least transparent and most competitive industry in the world.
In this world of rapid development, each and every country is the witness of urbanization and
commercialization. House prices change day by day as there is no amortization of fixed
assets like land or houses and they keep increasing. As there is lack of transparency,
developers often hoard up the prices and hype it more than their actual prices without doing
proper valuation. So, buyers should be vigilant enough during they buy properties as they are
investing their hard earned money or savings into it. Therefore the motive of this paper is to
forecast the coherent house prices for non-house holders based on their financial provisions
and their aspirations. The aim of the paper is also to help the developers to estimate the
selling cost of a house perfectly. This paper provides the opportunity to predict the price of
the house depending upon the features and specifications of the properties like square feet of
the house, number of bedrooms, ambiance and other extra amenities, using some data
analysis techniques like regression, factor analysis and other machine learning approaches.
This mechanism is often called feature engineering. This approach will help to maintain
uniformity in pricing in real estate industry and increase reliability among the customers. As
a part of the data analysis, multiple linear regression has been used to predict and develop a
model to find out relationship between the ‘house price’ of King County (USA), which is the
dependent variable and other eight independent variables, bathrooms, bedrooms, sqft_living,
sqft_lot, view, conditions, grade and sqft_above by fitting a linear equation. The report goes
forward by understanding the most significant factors among the eight variables, which we
used as independent variables for multiple regression analysis. As factor analysis is a data
reduction technique, it helps to discover the noteworthy determinants of house sales. The
paper concludes the research with visualizing how the house price has had variation with
different amenities and how the price got an upward trend, year on year with visual context
through maps and graphs. Therefore it makes it easier to identify trends, patterns, and outliers
within large data set.
CHAPTER 1
INTRODUCTION
1.1 Problem Statement

The aim of this project is to predict the house sales in King County, Washington State, USA
using Multiple Linear Regression (MLR) and to search influential underlying factors from a
set of observed variables to understand the independence of the variables.
1.2. Focus of the Study

For the purpose of this project we have collected secondary data collected from Kaggle (a
repository of published data and code) to find out the determinants that help to fix the price of
the properties and play a crucial role during the purchase process by the customers.
Sl Objective Statistical Analysis Tools to be used

No.
1. R Programming
Descriptive Statistics (R- Studio) /
MS Excel
2. To understand the Principal Component analysis IBM SPSS
determinants of house sales (Unsupervised Machine
Learning)
3. To predict the price of House Multiple Linear Regression Jupyter Notebook
Analysis (Supervised Machine (Python 3)
Learning)
3. Data visualization of the output Power BI
1.3. Data Source

The dataset consists of house prices from King County, an area in the US State of
Washington, this data also covers Seattle. The dataset was obtained from Kaggle. This data
was published/released under CC0: Public Domain. Unfortunately, the user has not indicated
the source of the data. The dataset consisted of 21 variables and 21613 observations.
1.4. Industry Overview

Real estate industry is one of the biggest and globally recognized industry in the world which
is supposed to be an industry of Rs.65, 000 crore by 2040 (US$ 9.30 billion), from Rs.12, 000
crore (US$ 1.72 billion) in 2019. United States, United Kingdom, Japan, Germany and China
are having the lion’s share in the real estate market globally. Real estate industry comprises
of mainly four categories, a) Housing, b) Retail, c) Hospitality and d) Commercial. The
industry is not restricted within buying or selling of property, but also includes development,
appraisal, marketing, selling, leasing, and management of commercial, industrial, residential,
and agricultural properties. India saw an investment of Rs.43780 crore in 2019. Major drivers
of the rapid growth of real estate globally are emergence of nuclear families, rapid
urbanization and rising household income.
Real Estate industry has become the part and parcel human being and it is the basic need of
all walks of lives of every economic condition. Investment in real estate generally seems to
be profitable because their property values do not decline rapidly. There are several
stakeholders who are associated with the changes in the real estate price like household
investors, bankers, policy makers and many more. Investment in real estate sector seems to
be an attractive choice for the investments. That’s why, predicting the real estate value is an
important economic index.
Property prices depend on various parameters in the economy and society. House prices are
strongly dependent on the size of the house and its geographical location. We have also
considered various intrinsic parameters (such as number of bedrooms, living area, no. of
bathrooms, loft space, presence of water body, no. of floor, internal condition) and also
external parameters (such as location, proximity, how old the property is.).
CHAPTER 2
LITERATURE REVIEW
1. “Influence Factors and Regression Model of Urban Housing Prices Based on

Internet Open Access Data” by Hao Wu, Hongzan Jiao, Yang Yu, Zhigang Li,
Lingbo Liu and Zheng Zeng, 22ndMay;2018.
This research paper depicts how the commercialization of housing and the
deepening of urbanization in China, housing prices have had increasing influence on
urban development. The study is based on Wuhan City, China. This research has
followed mainly hedonic regression model, where one prior assumption has been
that, all influence factors have a constant influence regardless of their different
geospatial locations, while in reality, factors such as neighborhood feature and
accessibility usually have a strong autocorrelation. The researchers have collected
the data from various sources like, housing price related data available on
developers or the aggregator’s website containing name, average unit price, location
and other information of commercial housing properties such as the age, floor area
ratio, and the price change over the years. Point of Interest Data, Location Based
Service Data (LBS) and Urban Planning and Internet Map Data also have been quite
helpful data sources. Then the raw data have been pre-processed before analysing by
‘Coordinate calibration’ and ‘Division of spatial units’ techniques. In the analysis
part, five kinds of influencers on the real estate price (dependent variable) have been
taken care of. Those are: 1) Location feature (how close is the site to a commercial
sector), 2) Architectural and Neighborhood Feature (specifications and amenities),
3) Public Facility, 4) Traffic Accessibility Feature (transportation) and 5) Natural
Environment Feature (river view, water area and urban public green space). In
analysis part, housing prices and the 13 factors in five categories are coded and
analysed on SPSS bivariate correlation analysis. Other than that, geographic
weighted regression and regression based upon ANN have also been performed.
2. “Multiple Regressions in Analysing House Price Variations” By Aminah Md

Yusof and Syuhaida Ismail ; March 2012
The aim of this research paper is to forecast the real-estate price on which the
sales of real estate are very much dependent. The paper demonstrates how the
statistical analysis is capable of analyzing property investment by considering
multiple determinants. It considers various factors which are more rigorous enable
better investment decision making. Property price can be determined by several
factors broadly categorized into structural (physical characteristics), location and
neighbourhood attributes. These factors will be considered in property investment
decision making. To analyze house price variations multiple regression analysis
has been used where modeling house price as dependent variables using
independent variables seek to segregate the impact or contribution of each
independent variables in price variation. This process involved - An identification
of house price determinants, Data collection, Model development, and
Assimilation. The analysis shows that the potential investors or developers will be
able to identify importance factors to be taken into consideration when developing
and buying a house. The application of multiple regression analysis in a house
data set explains better decision making in property investment.
3. “Real estate value prediction using multivariate regression models” by R Manjula,

Shubham Jain, Sharad Srivastava and Pranav Rajiv Kher ; November,2017
The article was published on November 2017, this was a collective effort by R
Manjula, Shubham Jain, Sharad Srivastava and Pranav Rajiv Kher. Here the authors
have used simple linear regression and multivariate regression models along with
polynomial regression to predict the housing prices with good accuracy. They have
calculated root mean square error for each these models and more models are used so
as to have lower Residual Sum of Squares error. The dataset used by the researchers
have 21,000 records which are divided into training data and testing data in the ratio
80:20. The dataset used here contains the number of factors like, price, date sold,
number of bedrooms, floors and other records. In the multivariate regression models
instead of using only one model they have used several models along with several
features and then they have compared the residual sum of squares error of those
models to see which model was fit according to their multivariate fit in their model.
For demonstrating the polynomial regression, the researchers have used features
which have been multiplied to several powers for making better model fit. For the
polynomial regression, the researchers have also plotted graphs for various models on
different powers to observe the model as the feature complexity changes.
4. ‘A hybrid regression technique for house price prediction’ by Lu, Li and Yang,
2017
They examined the creative feature engineering and proposed a hybrid Lasso and
Gradient boosting regression model that assures better prediction. They used Lasso
regression selection of features. They used the same dataset as the one used in this
study. They did many repetitions in feature engineering to find the optimum number
of features that can improve the prediction. More features were added, better the
score evaluation they received from the Kaggle. Hence, they added 400 features on
top of the 79 given features. Furthermore, they used Lasso for feature selection to
remove the unused features and found that 230 features providing the best score by
running a test on Ridge, Lasso and gradient boosting.
5. ‘The logistic lasso and ridge regression in predicting corporate failure’ by Jose
Manuel Pereira, Mario Basto and Amelia Ferreira da Silva, 2016
The authors performed an analysis to look into three methods. Lasso, Ridge and
Stepwise Regression were performed in SPSS to develop an empirical formula to
predict corporate bankruptcy. They defined two types of errors. The first error is the
percentage of failed enterprises predicted failed by the model. The second error was
the percentage of good enterprises predicted failed by the model. The results of this
study showed that the lasso and ridge algorithms usually favour the categorical
variable which appears with more weightage in the training set when they are
compared to the stepwise algorithm performed in SPSS.
6. ‘Comparison of artificial neural network and multiple linear regression for

prediction of live weight in hair goats’ by Yyu J. Agric. Sci. 2017
A study was accomplished in 2017 by SunaAkkol, Ash Akilli, Ibrahim Cemal, where
they did a comparison of artificial neural network and multiple linear regression for
prediction. In their study, the impact of different morphological measures on live
weight has been modelled by artificial neural networks and multiple linear regression
analyses. They used three different back-propagation techniquesforANN,namely
Levenberg-Marquardt, Bayesian regularisation, and Scaled conjugate. They showed
that ANN is more successful than multiple linear regression in the prediction they
performed.
7. ‘The comparison of methods of ANN with linear regression using specific

variables’ by Reza Gharoie Ahangar, Mahmood Yahyazadehfar and Hassan
Pournaghshband, 2010 .
The authors estimated the stock price of activated companies in Tehran 13(Iran) stock
exchange by using Linear Regression and Artificial Neural Network algorithms. The
authors considered ten macroeconomic variables and 30 financial variables. Then,
they obtained seven final variables, including three macroeconomic variables and four
financial variables, to estimate the stock price using Independent Components
Analysis (ICA). They showed that the value of estimation error square mean, the
absolute mean of error percentage and R square coefficient will be decreased
significantly after training the model with ANN.
8. ‘An empirical analysis of the price development on the Swedish housing market’by
Nils Landberg, 2015.
The author analysed the price development on the Swedish housing market and the
influences of qualitative variables on Swedish house prices. Landberg has studied the
impact of square meter price, population, new houses, new companies, foreign
background, foreign-born, unemployment rate, the number of breaks-in, the total
number of crimes, the number of available jobs ranking. According to Nils,
unemployment rate, number of crimes, interest rate, and new houses have a negative
effect on house prices. Landberg showed that the real estate market is not easy to be
analysed compared with the goods market because many alternative costs are
affecting the increase in house prices. The study shows that the increase in population
and qualitative variables have a positive effect on house prices. The interest rate, the
average income level, GDP, and the focus 8 In contrast, the rise in interest rates has a
significant negative influence on house prices. Besides, it showed unemployment rate
effects negatively on house prices, but the sale price and unemployment rate are not
directly correlated with each.
CHAPTER - 3
MAIN TEXT
3.1. Background of the Study
Investment in real estate generally seems to be profitable because their property values do not
decline rapidly. Changes in the real estate price can affect various household investors,
bankers, policy makers and many. Investment in the real estate sector seems to be an
attractive choice for the investments. Thus, predicting the real estate value is an important
economic index. An accurate prediction on the house price is important to prospective
homeowners, developers, investors, appraisers, tax assessors and other real estate market
participants, such as, mortgage lenders and insurers. Traditional house price prediction is
based on cost and sale price comparison lacking an accepted standard and a certification
process. Therefore, the availability of a house price prediction model helps fill up an
important information gap and improve the efficiency of the real estate market.
The purpose of this paper is to allow developers estimate the house’s selling cost. This paper
offers the ability to predict the house price based on property features and measurements such
as square feet of the house, number of bedrooms, environment and other extra amenities,
using certain techniques of data analysis such as regression, factor analysis and other
approaches to machine learning. This strategy would help to preserve price uniformity in the
real estate industry and improve consumer reliability.
3.2. Details of the Study

Consider a real estate company that has a dataset containing the prices of properties in the
King County. It wishes to use the data to optimise the sale prices of the properties based on
important factors such as area, bedrooms, parking, etc.
Essentially, the company wants —
 To identify the variables affecting house prices, e.g. area, number of rooms,
bathrooms, etc.
 To create a linear model that quantitatively relates house prices with variables such as
number of rooms, area, number of bathrooms, etc.
 To know the accuracy of the model, i.e. how well these variables can predict house
prices.
In this dataset we have to predict the sales price of houses in King County, Seattle. It includes
homes sold between May 2014 and May 2015. Before doing anything we should first know
about the dataset what it contains what are its features and what is the structure of data.
The dataset contains 20 house features plus the price, along with 21613 observations.
The description for the 20 features is given below:
1. Id: - It is the unique numeric number assigned to each house being sold.
2. Date: - It is the date on which the house was sold out.
3. Price: - It is the price of house which we have to predict so this is our target variable
and apart from it are our features.
4. Bedrooms: - It determines number of bedrooms in a house.
5. Bathrooms: - It determines number of bathrooms in a bedroom of a house.
6. Sqft_living :- It is the measurement variable which determines the measurement of
house in square foot.
7. Sqft_lot:- It is also the measurement variable which determines square foot of the lot.
8. Floors: It determines total floors means levels of house.
9. Waterfront: This feature determines whether a house has a view to waterfront 0 means
no 1 means yes.
10. View: - This feature determines whether a house has been viewed or not 0 means no 1
means yes.
11. Condition: - It determines the overall condition of a house on a scale of 1 to 5.
12. Grade:- It determines the overall grade given to the housing unit, based on King
County grading system on a scale of 1 to 11
13. Sqft_above :- It determines square footage of house apart from basement.
14. Sqft_basement:- It determines square footage of the basement of the house.
15. Yr_built:- It detrmines the date of building of the house.
16. Yr_renovated :- It detrmines year of renovation of house.
17. Zipcode :- It determines the zipcode of the location of the house.
18. Latitude: - It determines the latitude of the location of the house.
19. Longitude: - It determines the longitude of the location of the house.
20. Sqft_living15 :- Living room area in 2015(implies - some renovations)
21. Sqft_lot15:- LotSize area in 2015(implies- some renovations)
By observing the data, we can know that the price is dependent on various features like
bedrooms(which is most dependent feature), bathrooms, sqft_living(second most important
feature), sqft_lot, floors etc. The price is also dependent on the location of the house where it
is present. The other features like waterfront, view are less dependent on the price. Of all the
records, there are no missing values, which help us creating better model.
Dependant variable: House price
Independent variables:
i) Bedrooms v) View
ii) Bathrooms vi) Condition
iii) Sqft_living vii) Grade
iv) Sqft_lot viii) Sqft_above
CHAPTER - 4
DATA ANALYSIS, INTERPRETATION & VISUALIZATION
4.1. Missing Value Analysis
Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Missing data present various problems. First, the absence of data reduces statistical power,
which refers to the probability that the test will reject the null hypothesis when it is false.
Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the
representativeness of the samples. Fourth, it may complicate the analysis of the study. Each
of these distortions may threaten the validity of the trials and can lead to invalid conclusions.
> housedata=read.csv(file.choose(), header = T)
> View(housedata)
> attach(housedata)
> colSums(is.na(housedata))
price bedrooms bathrooms sqft_living sqft_lot floors view condition
0 0 0 0 0 0 0 0
grade sqft_above
0 0
4.2. Outliers Detection
Outliers are extreme values that deviate from other observations on data, they may indicate a
variability in a measurement or experimental errors. Detecting outliers is of major importance
for almost any quantitative discipline. Outliers can skew statistical measures and data
distributions, providing a misleading representation of the underlying data and relationships.
Removing outliers from training data prior to modelling can result in a better fit of the data
and, in turn, more skilful predictions.
There are many methods for detection of outliers, in this project we have used the z-score
method for outlier detection. The z-score or standard score of an observation is a metric that
indicates how many standard deviations a data point is from the sample’s mean, assuming a
Gaussian distribution. This makes z-score a parametric method. Z-score is a simple, yet
powerful method to get rid of outliers in data if we are dealing with parametric distributions
in a low dimensional feature space.
> boxplot(price, main='boxplot price')
>par(mfrow=c(2,2))
>boxplot(bedrooms, main='boxplot bedroom')
>boxplot(bathrooms, main='boxplot bathroom')
>boxplot(sqft_living, main= 'boxplot sqft_living')
>boxplot(sqft_lot, main='boxplot sqft_lot')
> boxplot(floors, main='boxplot floors')
> boxplot(view, main='boxplot view')
> boxplot(condition, main='boxplot condition')
>boxplot(grade, main='box plot grade')
> boxplot(sqft_above, main='boxplot sqft_above')
> #find absolute value of z-score for each value in each column
> z_scores= as.data.frame(sapply(housedata, function(housedata) (abs(housedata

mean(housedata))/sd(housedata))))
> head(z_scores)
price bedrooms bathrooms sqft_living sqft_lot

0.866398697 0.3987279 1.4474301 0.9798124 0.228316
0.005940074 0.3987279 0.1756026 0.533622 0.189881
0.98045506 1.4739253 1.4474301 1.426221 0.1232956
0.173719113 0.6764694 1.1494223 0.130547 0.2440088
0.082159123 0.3987279 0.1490039 0.4354115 0.1696495
1.877759284 0.6764694 3.0970615 3.6367068 2.0961362
floors view condition grade sqft_above

0.9154058 0.3057524 0.6291723 0.5588228 0.7346906
0.9364841 0.3057524 0.6291723 0.5588228 0.4608302
0.9154058 0.3057524 0.6291723 1.4095545 1.2298053
0.9154058 0.3057524 2.4442374 0.5588228 0.8916782
0.9154058 0.3057524 0.6291723 0.2919089 0.1308922
0.9154058 0.3057524 0.6291723 2.8441039 2.5378966
> #only keep rows in data frame with all z-scores less than absolute value of 3
> no_outliers <- z_scores[!rowSums(z_scores>3), ]
> dim(no_outliers)
[1] 19908 10
> dim(housedata)
[1] 21613 9
4.4 Factor Analysis
Factor analysis is a statistical technique to reduce the number of factors from a large number
of measured variables. There are mainly two types of factor analysis in market research –
Exploratory factor analysis (EFA) & Confirmatory factor analysis (CFA).
EFA is a data driven factor analysis but CFA is performed according to the model, equation
or the hypothesis.
Reason for using Factor Analysis for this Study:
Key Decision i. Should we conduct factor analysis for this study or not!
for Factor
Analysis:- ii. ‘Exploratory factor analysis’ or ‘Confirmatory factor analysis
- which factor analysis should we perform!
iii. How many factors or variables should we consider!
Method ‘Exploratory factor analysis’

Used:-
Software ‘IBM SPSS’

Used:-
Purpose:- Factor analysis helps in factor reduction from a large number of variables
and helps in analysing the independence of the variable. It mainly uses
correlation technique to understand the overlapping of the factors and helps
in reducing those in a significant way.
In this study, for factor analysis we are considering those variables only
which we used as independent variables for multiple regression analysis. By
this we want to know whether those variables are independent or not.
bedrooms view
bathrooms condition
sqft_living grade
sqft_lot sqft_above
The above 8 variables are taken into consideration for Factor Analysis.
Next step is to determine the method of factor analysis. Two basic approaches of factor
analysis are – Principal Component Analysis, Common Factor Analysis.
‘Principal Component Analysis’ is recommended when the primary concern is to determine

the minimum number of factors that will account for maximum variance in the data for use in
subsequent multivariate analysis.
In ‘Common Factor Analysis’ the factors are estimated based only on the common variance.
Communalities are inserted in the diagonal of the correlation matrix. This method is
appropriate when the primary concern is to identify the underlying dimensions and the
common variance is of interest.
Here in this study we will be using ‘Principal Component Analysis.’
Descriptive Statistics
Mean Std. Deviation Analysis N
bedrooms 3.37 .930 21613
bathrooms 2.1148 .77016 21613
sqft_living 2079.90 918.441 21613
sqft_lot 15106.97 41420.512 21613
view .23 .766 21613
condition 3.41 .651 21613
grade 7.66 1.175 21613
sqft_above 1788.39 828.091 21613
The first output from the analysis is a table of descriptive statistics for all the variables under
investigation. Typically, the mean, standard deviation and number of respondents (N) who
participated in the survey are given. There are total 21613 responses.
Looking at the mean, one can conclude that ‘sqft_lot’ is the most important variable that
influences the house price most. It has the highest mean of 15106.97
Correlation Matrixa
bedrooms bathrooms sqft_living sqft_lot view condition grade sqft_above
Correlation bedrooms 1.000 .516 .577 .032 .080 .028 .357 .478
bathrooms .516 1.000 .755 .088 .188 -.125 .665 .685
sqft_living .577 .755 1.000 .173 .285 -.059 .763 .877
sqft_lot .032 .088 .173 1.000 .075 -.009 .114 .184
view .080 .188 .285 .075 1.000 .046 .251 .168
condition .028 -.125 -.059 -.009 .046 1.000 -.145 -.158
grade .357 .665 .763 .114 .251 -.145 1.000 .756
sqft_above .478 .685 .877 .184 .168 -.158 .756 1.000
Sig. (1- bedrooms .000 .000 .000 .000 .000 .000 .000
tailed) bathrooms .000 .000 .000 .000 .000 .000 .000
sqft_living .000 .000 .000 .000 .000 .000 .000
sqft_lot .000 .000 .000 .000 .094 .000 .000
view .000 .000 .000 .000 .000 .000 .000
condition .000 .000 .000 .094 .000 .000 .000
grade .000 .000 .000 .000 .000 .000 .000
sqft_above .000 .000 .000 .000 .000 .000 .000
a. Determinant = .018
The next output from the analysis is the correlation coefficient. A correlation matrix is simple
a rectangular array of numbers which gives the correlation coefficients (r) between a single
variable and every other variables in the investigation. The correlation coefficient between a
variable and itself is always 1, hence the principal diagonal of the correlation matrix contains
1s. The correlation coefficients above and below the principal diagonal are the same. With
respect to Correlation Matrix Correlation value more than 0.5 show strong relation between
the variables.
KMO and Bartlett's Test

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .816
Bartlett's Test of Sphericity Approx. Chi-Square 87328.819
df 28
Sig. .000
The KMO measures the sampling adequacy (which determines if the responses given with
the sample are adequate or not) which should be close than 0.5 for a satisfactory factor
analysis to proceed. Kaiser (1974) recommend 0.5 (value for KMO) as minimum (barely
accepted), values between 0.7-0.8 acceptable, and values above 0.9 are superb.
Looking at the above table, the KMO measure is 0.816, which is more than 0.800 and
therefore the value is almost superb.
Bartlett’s test is another indication of the strength of the relationship among variables.
This tests the null hypothesis that the correlation matrix is an identity matrix.
An identity matrix is matrix in which all of the diagonal elements are 1 and all off diagonal
elements are close to 0. We want to reject this null hypothesis. From the above table, we can
see that the Bartlett’s Test of Sphericity is significant (0.000). This means that correlation
matrix is not an identity matrix.
Communalities
Initial Extraction
bedrooms 1.000 .564
bathrooms 1.000 .741
sqft_living 1.000 .901
sqft_lot 1.000 .744
view 1.000 .451
condition 1.000 .845
grade 1.000 .730
sqft_above 1.000 .830
Extraction Method: Principal Component Analysis.
The next item from the output is a table of communalities which shows how much of the in
the variables has been accounted for by the extracted factors.The table shows the value of
communalities for both initial and extraction. Initial value 1.000 defines that all the variables
are fully involved in factor analysis. The extraction communalities show the final
communalities value which is less than the initial communalities. Extraction communalities
value more than 0.400 indicates huge impact of that variable in factor analysis.
Total Variance Explained

Extraction Sums of Squared Rotation Sums of Squared
Initial Eigenvalues Loadings Loadings
% of Cumulative % of Cumulative % of Cumulative
Component Total Variance % Total Variance % Total Variance %
1 3.732 46.647 46.647 3.732 46.647 46.647 3.614 45.179 45.179
2 1.064 13.305 59.952 1.064 13.305 59.952 1.132 14.149 59.328
3 1.009 12.618 72.570 1.009 12.618 72.570 1.059 13.242 72.570
4 .913 11.410 83.980
5 .607 7.590 91.570
6 .333 4.161 95.731
7 .244 3.054 98.785
8 .097 1.215 100.000
Eigenvalue actually reflects the number of extracted factors whose sum should be equal to
number of items which are subjected to factor analysis. The next item shows all the factors
extractable from the analysis along with their eigenvalues.
The Eigenvalue table has been divided into three sub-sections, i.e. Initial Eigen Values,
Extracted Sums of Squared Loadings and Rotation of Sums of Squared Loadings. For
analysis and interpretation purpose we are only concerned with Extracted Sums of Squared
Loadings. Here one should note that Notice that the first factor accounts for 46.647% of the
variance, the second 13.305%, and the third 12.618% of the variance. All the remaining
factors are not significant.
The scree plot is a graph of the eigenvalues against all the factors. The graph is useful for
determining how many factors to retain. The point of interest is where the curve starts to
flatten. It can be seen that the curve begins to flatten between factors 3 and 4. Note also that
factor 4 onwards have an eigenvalue of less than 1, so only 3 factors have been retained.
Component Matrixa
Component
1 2 3
sqft_living .947
sqft_above .906
bathrooms .851
grade .850
bedrooms .640
condition .837
view .533
sqft_lot .807
a. 3 components extracted.
The above table shows the loadings (extracted values of each item under 3 factors) of the 8
variables on the 3 factors extracted. The higher the absolute value of the loading, the more the
factor contributes to the variable (We have extracted 3 factors wherein the 13 items are
divided into 3 factors according to most important items which similar responses in
component 1 and simultaneously in component 2 and 3)
Rotated Component Matrixa

Component
1 2 3
sqft_living .929
sqft_above .883
bathrooms .858
grade .825
bedrooms .700
sqft_lot .855
view .509
condition .912
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 5 iterations.
The idea of rotation is to reduce the number factors on which the variables under
investigation have high loadings. Rotation does not actually change anything but makes the
interpretation of the analysis easier. Rotation Method used here is Varimax rotation with
Kaiser Normalization.
Here we can find that sqft_living, sqft_above, bathrooms, grade, bedrooms are loaded under
factor 1.
Sqft_lot and view are loaded under factor 2.
And condition is loaded under factor 3.
Reproduced Correlations
bedrooms bathrooms sqft_living sqft_lot view condition grade sqft_above
Reproduced bedrooms .564a .581 .623 -.165 .140 .108 .520 .559
Correlation a
bathrooms .581 .741 .805 .066 .200 -.144 .723 .774
sqft_living .623 .805 .901a .177 .320 -.076 .799 .850
sqft_lot -.165 .066 .177 .744a .397 -.122 .197 .187
view .140 .200 .320 .397 .451a .308 .245 .242
condition .108 -.144 -.076 -.122 .308 .845a -.195 -.222
grade .520 .723 .799 .197 .245 -.195 .730a .778
sqft_above .559 .774 .850 .187 .242 -.222 .778 .830a
Residualb bedrooms -.065 -.046 .196 -.061 -.079 -.163 -.081
bathrooms -.065 -.050 .022 -.012 .019 -.058 -.088
sqft_living -.046 -.050 -.005 -.035 .017 -.037 .026
sqft_lot .196 .022 -.005 -.323 .113 -.083 -.003
view -.061 -.012 -.035 -.323 -.262 .006 -.074
condition -.079 .019 .017 .113 -.262 .050 .064
grade -.163 -.058 -.037 -.083 .006 .050 -.022
sqft_above -.081 -.088 .026 -.003 -.074 .064 -.022
a. Reproduced communalities
b. Residuals are computed between observed and reproduced correlations. There are 16 (57.0%)
nonredundant residuals with absolute values greater than 0.05.
‘Reproduced Correlation’ table consists of two components – ‘Reproduced Correlation’ and

‘Residual’. Reproduced correlation matrix is defined as per the extracted factors. The
reproduced correlation value should be as close to as the original correlation value that the
difference between those two values should nearly equal to zero. If the reproduced correlation
value is nearly equal to zero then if signifies that the extracted factors create a huge impact on
the result of factor analysis. Residual component shows the difference between the difference
between the original correlation and reproduced correlation. There are 16 (57.0%) no
redundant residuals with absolute values greater than 0.05.
Component Score Coefficient Matrix

Component
1 2 3
bedrooms .241 -.278 .211
bathrooms .248 -.076 -.026
sqft_living .252 .046 .056
sqft_lot -.120 .813 -.117
view .005 .447 .356
condition -.006 -.015 .861
grade .216 .069 -.083
sqft_above .235 .045 -.103
Rotation Method: Varimax with Kaiser Normalization.
Component Scores.
‘Component Score Coefficient Matrix’ table shows the coefficient of each variable. The
rotation method applied here is orthogonal varimax rotation.
4.3. Multiple Linear Regression
Multiple Linear Regression (MLR) is a supervised technique used to estimate the relationship
between one dependent variable and more than one independent variables. Identifying the
correlation and its cause-effect helps to make predictions by using these relations. To
estimate these relationships, the prediction accuracy of the model is essential; the complexity
of the model is of more interest. However, Multiple Linear Regression is prone to many
problems such as multicollinearity, noises, and over fitting, which effect on the prediction
accuracy.
Regularised regression plays a significant part in Multiple Linear Regression because it helps
to reduce variance at the cost of introducing some bias, avoid the over fitting problem and
solve ordinary least squares (OLS) problems. There are two types of regularisation
techniques L1 norm (least absolute deviations) and L2 norm (least squares). L1 and L2 have
different cost functions regarding model complexity.
In this study the variable which is predicting is known as dependent variable and independent
variable are those on the basis of which predictions are made. Regression and correlation
analysis helps in understanding the relations between two or more variables.
Ŷ = ß + ß X + ß X + ………………+ Ԑ
0 1 1 2 2 Multiple regression model
Ŷ is the value of dependent variable estimated from the regression equation.
X X are the independent variables.

1, 2
ß ß are the coefficients corresponding to the X X variables.

1, 2 1, 2
Ԑ is the error associated with the dependent variable.
ŷ = b + b x +b x + ……….. + e
0 1 1 2 2 Regression equation
Where;
ŷ is the dependent variable.
b b are the coefficients of corresponding to the x variable.
1, 2
x x are the independent variables, e is the error associated with the dependent variable.
1, 2
Reason for using Regression Analysis for this Study:
Method Multiple Linear Regression Analysis

Used :-
Software Anaconda Navigator - Jupyter Notebook

Used:-
Purpose :- Here in this study our objective is to predict the house price in King
County.
Regression analysis helps in analysing the relation between the variables.
House Price can be decided based on several other factors.
The dependent variable considers here in this study is the ‘House Price’
which signifies the sales of houses in King County. The independent
variables help in predicting the price of house and find the correlation
among the variables.
In this dataset the sales price of houses in King County, Seattle are present. It includes homes
sold between May 2014 and May 2015. Before doing anything we should first know about
the dataset what it contains what are its features and what is the structure of data.
By observing the data, we can know that the price is dependent on various features like
bedrooms (which is the most dependent feature), bathrooms, sqft_living (second most
important feature), sqft_lot, Sqft_above, Yr_built, Sqft_living15, Sqft_lot15. Of all the
records, there are no missing values. Initially the data set contains 21613 observations, but
after removing the outliers we get 20371 observations. Now this data set will help us creating
better model.
Now here in this study,
Y= Price X = View
5
X = Bedrooms
1 X = Condition
6
X = Bathrooms
2 X = Grade
7
X = Sqft_living
3 X = sqft_above
8
X = Sqft_lot
4
The above figure shows the data set in which the dependent and independent variables are
listed to perform the multiple linear regression analysis.
price bedrooms bathrooms
Mean 540182.1588 Mean 3.370841623 Mean 2.114757322

Standard Error 2498.831535 Standard Error 0.006326366 Standard Error 0.00523872
Median 450000 Median 3 Median 2.25
Mode 450000 Mode 3 Mode 2.5
Standard Deviation 367362.2317 Standard Deviation 0.930061831 Standard Deviation 0.770163157
Sample Variance 1.35E+11 Sample Variance 0.86501501 Sample Variance 0.593151289
Kurtosis 34.52244441 Kurtosis 49.06365318 Kurtosis 1.279902444
Skewness 4.021715573 Skewness 1.974299535 Skewness 0.511107573
Range 7625000 Range 33 Range 8
Minimum 75000 Minimum 0 Minimum 0
Maximum 7700000 Maximum 33 Maximum 8
Sum 11674956998 Sum 72854 Sum 45706.25
Count 21613 Count 21613 Count 21613
sqft_living sqft_lot view
Mean 2079.899736 Mean 15106.96757 Mean 0.234303428
Median 1910 Median 7618 Median 0
Mode 1300 Mode 5000 Mode 0
Sample Variance 843533.6814 Sample Variance 1715658774 Sample Variance 0.587242617
Sum 44952873 Sum 326506890 Sum 5064
condition grade sqft_above

Mean 3.40942951 Mean 7.656873178 Mean 1788.390691
Median 3 Median 7 Median 1560
Mode 3 Mode 7 Mode 1300
Sample Variance 0.423466512 Sample Variance 1.381703289 Sample Variance 685734.6673
Sum 73688 Sum 165488 Sum 38652488
 Central tendency. Themean and themedian are summary measures used to describe
central tendency - the most "typical" value in a set of values. With a normal
distribution, the mean is equal to the median.
 Skewness. Skewness is a measure of the asymmetry of a probability distribution. If
observations are equally distributed around the mean, the skewness value is zero;
otherwise, the skewness value is positive or negative. As a rule of thumb, skewness
between -2 and +2 is consistent with a normal distribution.
 Kurtosis. Kurtosis is a measure of whether observations cluster around the mean of
the distribution or in the tails of the distribution. The normal distribution has a
kurtosis value of zero. As a rule of thumb, kurtosis between -2 and +2 is consistent
with a normal distribution.
Together, these descriptive measures provide a useful basis for judging whether a dataset
satisfies the assumption of normality.
The mean is nearly equal to the median. And both skewness and kurtosis are between -2 and
+2.
Conclusion: These descriptive statistics are consistent with a normal distribution.

#importing numpy and pandas, seaborn
import numpy as np #linear algebra

import pandas as pd #datapreprocessing, CSV file I/O
import seaborn as sns #for plotting graphs
import matplotlib.pyplot as plt
Path='E:/4th Sem _MBA/BIA/Project_BIA/House_Data.csv'

housedata=pd.read_csv(Path)
print(housedata.shape)
(21613, 21)
#checking the data types

housedata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype

0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
# check for any correlations between variables

corr=housedata.corr()
sns.heatmap(corr)
<matplotlib.axes._subplots.AxesSubplot at 0x4d44128310>
#heatmap for all the required features for correlation with the
variables which we will consider for regression
df=housedata[['price','bedrooms','bathrooms','sqft_living',
'sqft_lot','view','condition','grade',
'sqft_above']]
fig,ax=plt.subplots(figsize=(10,10))
sns.heatmap(df.corr(),cmap='YlGnBu',annot=True,ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x4d444049a0>
The above table shows the correlation between the variables that are considered for multiple
regressions. The correlation table also helps in finding the multicollinearity between the
dependent and independent variables. Multicollinearity exists when the correlation value is
beyond 0.80. In this study all the ‘Pearson Correlation’ values are below 0.80 which signifies
there is no multicollinearity.
sns.pairplot(data=housedata,
x_vars=["bedrooms","bathrooms","sqft_living","sqft_lot","grade","sqft_a
bove"], y_vars=["price"])
<seaborn.axisgrid.PairGrid at 0x4d451ba1c0>
fromstatsmodels.formula.apiimport ols
fromsklearn.linear_modelimport LinearRegression
fromstatsmodels.formula.apiimport ols
fromstatsmodels.stats.anovaimport anova_lm
importmatplotlib.pyplotasplt
# Multiple Linear Regression Analysis (Supervised Machine Learning)
reg_model=ols(formula="price~bedrooms+bathrooms+sqft_living+sqft_lot+vi
ew+condition+grade+sqft_above",data=housedata).fit()
print(reg_model.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 4.87e+05. This might indicate that
there are
strong multicollinearity or other numerical problems.
 R- Square value shows the goodness of fit. R- Square value lies in between 0 to 1.
Here R-square value is 0.590 which signifies that the dependent variable, ‘Price of
House’ is 59.00% in association with the independent variables. R-square value can
only increase if we add extra explanatory variables.
 Adjusted R- Square is a measure that adjusts R-Square for a number of independent
variables in the regression model. It helps in analysing that the extra independent
variables which are added have really any significant or not in the regression model.
If by adding extra independent variables, the adjusted R-Square value decreases, then
we can eliminate those variables.
The Durbin Watson Statistic is a test statistic used in statistics to detect autocorrelation in
the residuals from aregression analysis. It is named after professor James Durbin, a British
statistician and econometrician, and Geoffrey Stuart Watson, an Australian statistician.
Serial correlation, also called autocorrelation, refers to the degree ofcorrelation between the
values of variables across different data sets. It is usually used when working with time series
data in which observations occur at different points in time.
How to Calculate the Durbin Watson Statistic:
The hypotheses followed for the Durbin Watson statistic:
H(0) = First-order autocorrelation does not exist.
H(1) = First-order autocorrelation exists.
The assumptions of the test are:
 Errors are normally distributed with a mean value of 0

 All errors are stationary.
Interpreting the Durbin Watson Statistic:
The Durban Watson statistic will always assume a value between 0 and 4. A value of DW = 2
indicates that there is no autocorrelation. When the value is below 2, it indicates a positive
autocorrelation, and a value higher than 2 indicates a negative serial correlation.
To test for positive autocorrelation at significance level α (alpha), the test statistic DW is
compared to lower and upper critical values:
If DW < Lower critical value: There is statistical evidence that the data is positively
autocorrelated
If DW > Upper critical value: There is no statistical evidence that the data is positively
correlated.
If DW is in between the lower and upper critical values: The test is inconclusive.
 2 is no autocorrelation.
 0 to <2 is positive autocorrelation (common in time series data).
 >2 to 4 is negative autocorrelation (less common in time series data).
A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively normal.
The significance of the entire regression model is determined using the F- Statistics.
Degree of Freedom (DoF) for the regression model is equal to the number of independent
variables (k) which is equal to 8.
DoF for the Residual = (N – k - 1) = (21613 – 8 - 1) = 21604
Here the F value = 3882
DoF for Numerator = 8
DoF for denominator = 21604
For 95% level of confidence; the model is statistically significant.
From the coefficient table we can find the value of the coefficients (ß – values) for the
corresponding independent variables.
Ŷ = -681800 - 36630 * bedroom - 14990 * bathroom + 217.64 * sqft_living - 0.33 *

sqft_lot + 87550 * view + 54020 * condition + 100700 * grade - 25.74 * sqft_above
anova_lm(reg_model)
df sum_sq mean_sq F PR(>F)
bedrooms 1 2.77E+14 2.77E+14 5007.213742 0.00E+00
bathrooms 1 5.32E+14 5.32E+14 9616.347052 0.00E+00
sqft_living 1 6.68E+14 6.68E+14 12072.58771 0.00E+00
sqft_lot 1 5.08E+12 5.08E+12 91.820263 1.05E-21

view 1 1.11E+14 1.11E+14 2012.264411 0.00E+00
condition 1 1.66E+13 1.66E+13 300.865024 6.07E-67
grade 1 1.06E+14 1.06E+14 1923.454078 0.00E+00
sqft_above 1 1.93E+12 1.93E+12 34.903839 3.52E-09
Residual 21604 1.19E+15 5.53E+10 NaN NaN
The probability value of all the variables is almost nearly equal to zero. Now, considering
95% confidence interval we can say all the variables are statistically significant for the
analysis and has an impact on regression model.
frompatsyimport dmatrices
fromstatsmodels.stats.outliers_influenceimport
variance_inflation_factor
#find design matrix for linear regression model using 'rating' as

response variable
y, X =
dmatrices("price~bedrooms+bathrooms+sqft_living+sqft_lot+view+condition
+grade+sqft_above",data=housedata,return_type='dataframe')
#calculate VIF for each explanatory variable

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i
inrange(X.shape[1])]
vif['variable'] = X.columns
vif
VIF variable
0 119.844731 Intercept
1 1.628989 bedrooms
2 2.55807 bathrooms
3 6.918588 sqft_living
4 1.049917 sqft_lot
5 1.155793 view
6 1.081997 condition
7 2.841084 grade
8 5.08775 sqft_above
The value for Variance Influence Factor(VIF) starts at 1 and has no upper limit. A general
rule of thumb for interpreting VIFs is as follows:
 A value of 1 indicates there is no correlation between a given explanatory variable

and any other explanatory variables in the model.
 A value between 1 and 5 indicates low correlation between a given explanatory
variable and other explanatory variables in the model.
 A value greater than 5 indicates moderate correlation between a given explanatory
variable and other explanatory variables in the model.
 A value greater than 10 indicates potentially severe corelation between a given

explanatory variable and other explanatory variables in the model.
From the above result we can interpret the variable bedroom, bathroom, sqft_lot, view,
condition and grade had low corelation with other independent variables.
For the variable sqft_living and sqft_above is having moderate corelation with other
independent variables.
4.5. Data Visualization
Method Used:- Data Visualization using graphical representation.
Software MS PowerBI
Used:-
Purpose:- This analysis helps in understanding how different variables are impacting
on house price.
The above line diagram shows how ‘price of houses’ has been fluctuating with the change of
the date of building of the house on the price of the house (‘yr_built’)
The above data dashboard shows how ‘average of price’ of house has been changing for
different variables like - ‘floors’, ‘conditions’, ‘grade’ and ‘view’
The above two line diagrams show how the 'average price’ of a house has been changing for
variables like ‘sqft_living’ and ‘sqft_above’.
CHAPTER 5
RESULT & DISCUSSION
Here for this study our aim is to predict the house price of King County, USA. Secondary data
has been collected from Kaggle and it has been analyzed using analytical tools. Different
variables of house price prediction have been considered to meet the objective of the study
like; bedrooms (which is the most dependent feature), bathrooms, sqft_living (second most
important feature), sqft_lot etc.
 Factor Analysis has been performed with the data to understand the most important
factors those have direct impact on house price.
Exploratory factor analysis has been performed in this study to reduce factors from large
number of variables.
Correlation table shows the relation between the considered variables. Correlation value more
than 0.500 show strong correlation between the variables.
Here the KMO value is 0.816 which is highly acceptable. So we can say sampling adequacy
has been achieved.
The Sig. value in Bartlett's Test of Sphericity is 0.000 which is less than 0.05 and indicates
that the model is statistically significant.
Extraction communalities value more than 0.400 indicates huge impact of that variable in
factor analysis.
Scree plot shows that factor 4 onwards have an eigenvalue of less than 1, so only 3 factors
have been retained.
 ‘Multiple linear regression and correlation’ analysis helps in understanding the relation
between the variables which are considered for understanding customer experience or
satisfaction. To predict the sales price of a house we have used Multiple Linear
Regression Analysis.
The output of the analysis shows that there are no multicolinearity lies between the
considered variables. Here in this study, R- square value is 0.590 which signifies that the
dependent variable, ‘house price’ is 59.00% in association with the independent variables.
The F-test from the ANOVA table shows that at 95% level of confidence, the model is
statistically significant.
From the coefficient table the multiple regression model can be derived as –
Ŷ = -681800 - 36630 * bedroom - 14990 * bathroom + 217.64 * sqft_living - 0.33 *
sqft_lot + 87550 * view + 54020 * condition + 100700 * grade - 25.74 * sqft_above
The above equation helps in predicting the house price with the change of the values of the
independent variables.
CHAPTER 6
CONCLUSION
Recent real estate statistics prove beyond doubt that property valuations have taken a turn for
the better. The real estate sector definitely is on the rise, with the growth thrust being
provided by important factors such as demographics, interest rates, location and the state of
the economy, which affect the prices of property in the country. Correct insights in regards to
the right time for purchase of property, price escalations, recessions in the real estate market
and other indicators, help in making valuable purchase decisions. The housing market is
influenced by the state of the economy, interest rates, real income and changes in the size of
the population. As well as these demand-side factors, house prices will be determined by
available supply. With periods of rising demand and limited supply, we will see rising house
prices, rising rents and increased risk of homelessness.
Fig: Main factors that affect the housing market

Source: economicshelp.org
As data acquisition is considered one of the biggest difficulties in the study of housing prices,
the present study uses an approach using the data collected from Kaggle of the House Price of
King County, USA which contributes to the methodology of research on housing prices.
These open network data, offer opportunities to acquire more diverse types and a much
greater number of sample and with greater timeliness, thus providing more valuable findings
forthe quantitative spatial analysis of different regions.
From the planning perspective, the present study chooses various factors that may influence
housing price, so as to make clear the relationship between the influence factors and housing
prices as well as their variations in space.
This project demonstrates how the statistical analysis could be utilized tobetter analyse
investment. With an application of ‘Multiple Linear Regression’ and ‘Factor Analysis’, the
contribution of each price determinant for the overall price of a house can be predicted and
understand the important underlying factors. The analysis thus, aids the decision making
process. In this case, thepotential investors or developers will be able to identify important
factors to be taken into consideration when developing and buying a house. As the
contribution of each variable could be quantified, it is possible to determine the significance
of each variable. The application of multiple regression analysis in a house data set explains
or models variation in house price which demonstrates good examples of the strategic
application of mathematical tools to aid analysis hence decision making in property
investment.
CHAPTER 7
LIMITATIONS OF THE STUDY
1. It is difficult to verify the accuracy of research conducted by a third party.

2. Secondary research results may be affected by the assumptions and motivations of the
organisation which carried it out or funded the research.
3. The time effect of the house price, which could potentially impact the estimated
results was ignored (the same house should have different price in different years,
assuming that age factor is constant). By its nature, this secondary research was
conducted in the past so it may no longer be fully relevant.
4. The house price used is not the actual sale price but the estimated price due to the
difficulty in obtaining the real data from the market.
5. The house price could be affected by some other economic factors (such as exchange
rate and interest rate) that are not included in the estimation.
CHAPTER 8
KEY LEARNINGS FROM THE PROJECT
In today’s real estate world, it has become tough to store such huge data and extract them for
one’s own requirement. Also, the extracted data should be useful. The system makes optimal
use of the Linear Regression Algorithm. The system makes use of such data in the most
efficient way. The linear regression algorithm helps to fulfil customers by increasing the
accuracy of estate choice and reducing the risk of investing in an estate. A lot of features that
could be added to make the system more widely acceptable. More factors like recession that
affect the house prices shall be added. In-depth details of every property will be added to
provide ample details of a desired estate. This will help the system to run on a larger level.
An accurate prediction on the house price is important for prospective homeowners,

developers, investors, appraisers, tax assessors and other real estate market participants, such
as, mortgage lenders and insurers. Traditional house price prediction is based on cost and sale
price comparison lacking an accepted standard. Therefore, the availability of a house price
prediction model helps fill up an important information gap and improve the efficiency of the
real estate market.
CHAPTER 9
REFERENCES
i. “Influence Factors and Regression Model of Urban Housing Prices Based on

Internet Open Access Data” by Hao Wu, Hongzan Jiao, Yang Yu, Zhigang Li,
Lingbo Liu and Zheng Zeng, 22ndMay;2018.
ii. “Multiple Regressions in Analysing House Price Variations” By Aminah Md

Yusof and Syuhaida Ismail ; March 2012
iii. “Real estate value prediction using multivariate regression models” by R Manjula,
Shubham Jain, Sharad Srivastava and Pranav Rajiv Kher ; November,2017
iv. ‘A hybrid regression technique for house price prediction’ by Lu, Li and Yang, 2017
v. ‘The logistic lasso and ridge regression in predicting corporate failure’ by Jose
Manuel Pereira, Mario Basto and Amelia Ferreira da Silva, 2016
vi. ‘Comparison of artificial neural network and multiple linear regression for prediction
of live weight in hair goats’ by Yyu J. Agric. Sci. 2017
vii. ‘The comparison of methods of ANN with linear regression using specific variables’
by Reza Gharoie Ahangar, Mahmood Yahyazadehfar and Hassan Pournaghshband,
2010 .
viii. ‘An empirical analysis of the price development on the Swedish housing market’by
Nils Landberg, 2015.
ix. ANDERSON, SWEENEY, WILLIAMS. Statistics for Business & Economics
x. P SASHIKALA, adapted. 2019. Business Analytics. Delhi: Cengage Learning India
Private Limited.
xi. P SASHIKALA, adapted. 2020. Advanced Business Analytics. Delhi: Cengage
Learning India Private Limited.
xii. CHURCHILL, LACOBUCCI, ISRAEL. Marketing Research: A South Asian
Prespective. Cengage Learning.
xiii. MALHOTRA, DASH. Marketing Research- An Applied Orientation; Seventh

Edition; 2019. Pearson.
xiv. Fundamentals data analysis & decision making models – theory. [Video] Instructed
by Manish Gupta. Udemy.
xv. wellbeing@school. Understanding and interpreting box plots. [Online]. Available
from: wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-
box-plots [Accessed 20 November 2020].
xvi. Priya Chetty. 2015; “Interpretation of factor analysis using SPSS”; Available on:
projectguru.in; [Accessed 4th January, 2021]
CHAPTER - 10
APPENDICES
1. See the entire Regression Analysis at: - https://tinyurl.com/regression-output-

python3
2. See the entire visualization at: https://tinyurl.com/house-data-visulization

Housepriceprediction StatisticalAnalysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Housepriceprediction StatisticalAnalysis

Uploaded by

Copyright:

Available Formats

A Project Report on

4. DATA ANALYSIS, VISUALIZATION AND INTERPRETATION 12

5. RESULT AND DISCUSSION 40

8. KEY LEARNING FROM THE PROJECT 45

1.1 Problem Statement

1.2. Focus of the Study

Sl Objective Statistical Analysis Tools to be used

1.3. Data Source

1.4. Industry Overview

1. “Influence Factors and Regression Model of Urban Housing Prices Based on

2. “Multiple Regressions in Analysing House Price Variations” By Aminah Md

3. “Real estate value prediction using multivariate regression models” by R Manjula,

6. ‘Comparison of artificial neural network and multiple linear regression for

7. ‘The comparison of methods of ANN with linear regression using specific

3.1. Background of the Study

3.2. Details of the Study

Essentially, the company wants —

The description for the 20 features is given below:

4.1. Missing Value Analysis

> housedata=read.csv(file.choose(), header = T)

price bedrooms bathrooms sqft_living sqft_lot floors view condition

> boxplot(price, main='boxplot price')

> boxplot(view, main='boxplot view')

> boxplot(condition, main='boxplot condition')

>boxplot(grade, main='box plot grade')

> z_scores= as.data.frame(sapply(housedata, function(housedata) (abs(housedata

price bedrooms bathrooms sqft_living sqft_lot

floors view condition grade sqft_above

> no_outliers <- z_scores[!rowSums(z_scores>3), ]

Reason for using Factor Analysis for this Study:

iii. How many factors or variables should we consider!

Method ‘Exploratory factor analysis’

Software ‘IBM SPSS’

‘Principal Component Analysis’ is recommended when the primary concern is to determine

Here in this study we will be using ‘Principal Component Analysis.’

KMO and Bartlett's Test

Total Variance Explained

Rotated Component Matrixa

a. Rotation converged in 5 iterations.

Sqft_lot and view are loaded under factor 2.

And condition is loaded under factor 3.

‘Reproduced Correlation’ table consists of two components – ‘Reproduced Correlation’ and

Component Score Coefficient Matrix

4.3. Multiple Linear Regression

Ŷ is the value of dependent variable estimated from the regression equation.

X X are the independent variables.

ß ß are the coefficients corresponding to the X X variables.

Ԑ is the error associated with the dependent variable.

Method Multiple Linear Regression Analysis

Software Anaconda Navigator - Jupyter Notebook

Regression analysis helps in analysing the relation between the variables.

House Price can be decided based on several other factors.

Now here in this study,

price bedrooms bathrooms

Mean 540182.1588 Mean 3.370841623 Mean 2.114757322

condition grade sqft_above

Conclusion: These descriptive statistics are consistent with a normal distribution.

import numpy as np #linear algebra

Path='E:/4th Sem _MBA/BIA/Project_BIA/House_Data.csv'

#checking the data types

# Column Non-Null Count Dtype

# check for any correlations between variables

# Multiple Linear Regression Analysis (Supervised Machine Learning)