Professional Documents
Culture Documents
A Statistical Analysis on
House Price Prediction and
Understanding its
Determinants using Machine
Learning Techniques
By:
1. Subhodip Pal
2. Debpratik Ghosh
3. Sabyasachi Bhar
- 2021
TABLE OF CONTENTS
ABSTRACT 2
1. INTRODUCTION 3
1.1.Problem Statement 3
1.2.Focus of the Study 3
1.3.Data Source 3
1.4. Industry Overview 3
2. LITERATURE REVIEW 5
3. MAIN TEXT 14
3.1. Background of the Study 9
3.2.Details of the Study 9
6. CONCLUSION 42
7. LIMITATIONS OF THE STUDY 44
9. REFERENCES 46
10. APPENDICES 48
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 1
ABSTRACT
Real estate industry is one of the least transparent and most competitive industry in the world.
In this world of rapid development, each and every country is the witness of urbanization and
commercialization. House prices change day by day as there is no amortization of fixed
assets like land or houses and they keep increasing. As there is lack of transparency,
developers often hoard up the prices and hype it more than their actual prices without doing
proper valuation. So, buyers should be vigilant enough during they buy properties as they are
investing their hard earned money or savings into it. Therefore the motive of this paper is to
forecast the coherent house prices for non-house holders based on their financial provisions
and their aspirations. The aim of the paper is also to help the developers to estimate the
selling cost of a house perfectly. This paper provides the opportunity to predict the price of
the house depending upon the features and specifications of the properties like square feet of
the house, number of bedrooms, ambiance and other extra amenities, using some data
analysis techniques like regression, factor analysis and other machine learning approaches.
This mechanism is often called feature engineering. This approach will help to maintain
uniformity in pricing in real estate industry and increase reliability among the customers. As
a part of the data analysis, multiple linear regression has been used to predict and develop a
model to find out relationship between the ‘house price’ of King County (USA), which is the
dependent variable and other eight independent variables, bathrooms, bedrooms, sqft_living,
sqft_lot, view, conditions, grade and sqft_above by fitting a linear equation. The report goes
forward by understanding the most significant factors among the eight variables, which we
used as independent variables for multiple regression analysis. As factor analysis is a data
reduction technique, it helps to discover the noteworthy determinants of house sales. The
paper concludes the research with visualizing how the house price has had variation with
different amenities and how the price got an upward trend, year on year with visual context
through maps and graphs. Therefore it makes it easier to identify trends, patterns, and outliers
within large data set.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 2
CHAPTER 1
INTRODUCTION
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 3
crore (US$ 1.72 billion) in 2019. United States, United Kingdom, Japan, Germany and China
are having the lion’s share in the real estate market globally. Real estate industry comprises
of mainly four categories, a) Housing, b) Retail, c) Hospitality and d) Commercial. The
industry is not restricted within buying or selling of property, but also includes development,
appraisal, marketing, selling, leasing, and management of commercial, industrial, residential,
and agricultural properties. India saw an investment of Rs.43780 crore in 2019. Major drivers
of the rapid growth of real estate globally are emergence of nuclear families, rapid
urbanization and rising household income.
Real Estate industry has become the part and parcel human being and it is the basic need of
all walks of lives of every economic condition. Investment in real estate generally seems to
be profitable because their property values do not decline rapidly. There are several
stakeholders who are associated with the changes in the real estate price like household
investors, bankers, policy makers and many more. Investment in real estate sector seems to
be an attractive choice for the investments. That’s why, predicting the real estate value is an
important economic index.
Property prices depend on various parameters in the economy and society. House prices are
strongly dependent on the size of the house and its geographical location. We have also
considered various intrinsic parameters (such as number of bedrooms, living area, no. of
bathrooms, loft space, presence of water body, no. of floor, internal condition) and also
external parameters (such as location, proximity, how old the property is.).
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 4
CHAPTER 2
LITERATURE REVIEW
This research paper depicts how the commercialization of housing and the
deepening of urbanization in China, housing prices have had increasing influence on
urban development. The study is based on Wuhan City, China. This research has
followed mainly hedonic regression model, where one prior assumption has been
that, all influence factors have a constant influence regardless of their different
geospatial locations, while in reality, factors such as neighborhood feature and
accessibility usually have a strong autocorrelation. The researchers have collected
the data from various sources like, housing price related data available on
developers or the aggregator’s website containing name, average unit price, location
and other information of commercial housing properties such as the age, floor area
ratio, and the price change over the years. Point of Interest Data, Location Based
Service Data (LBS) and Urban Planning and Internet Map Data also have been quite
helpful data sources. Then the raw data have been pre-processed before analysing by
‘Coordinate calibration’ and ‘Division of spatial units’ techniques. In the analysis
part, five kinds of influencers on the real estate price (dependent variable) have been
taken care of. Those are: 1) Location feature (how close is the site to a commercial
sector), 2) Architectural and Neighborhood Feature (specifications and amenities),
3) Public Facility, 4) Traffic Accessibility Feature (transportation) and 5) Natural
Environment Feature (river view, water area and urban public green space). In
analysis part, housing prices and the 13 factors in five categories are coded and
analysed on SPSS bivariate correlation analysis. Other than that, geographic
weighted regression and regression based upon ANN have also been performed.
The aim of this research paper is to forecast the real-estate price on which the
sales of real estate are very much dependent. The paper demonstrates how the
statistical analysis is capable of analyzing property investment by considering
multiple determinants. It considers various factors which are more rigorous enable
better investment decision making. Property price can be determined by several
factors broadly categorized into structural (physical characteristics), location and
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 5
neighbourhood attributes. These factors will be considered in property investment
decision making. To analyze house price variations multiple regression analysis
has been used where modeling house price as dependent variables using
independent variables seek to segregate the impact or contribution of each
independent variables in price variation. This process involved - An identification
of house price determinants, Data collection, Model development, and
Assimilation. The analysis shows that the potential investors or developers will be
able to identify importance factors to be taken into consideration when developing
and buying a house. The application of multiple regression analysis in a house
data set explains better decision making in property investment.
The article was published on November 2017, this was a collective effort by R
Manjula, Shubham Jain, Sharad Srivastava and Pranav Rajiv Kher. Here the authors
have used simple linear regression and multivariate regression models along with
polynomial regression to predict the housing prices with good accuracy. They have
calculated root mean square error for each these models and more models are used so
as to have lower Residual Sum of Squares error. The dataset used by the researchers
have 21,000 records which are divided into training data and testing data in the ratio
80:20. The dataset used here contains the number of factors like, price, date sold,
number of bedrooms, floors and other records. In the multivariate regression models
instead of using only one model they have used several models along with several
features and then they have compared the residual sum of squares error of those
models to see which model was fit according to their multivariate fit in their model.
For demonstrating the polynomial regression, the researchers have used features
which have been multiplied to several powers for making better model fit. For the
polynomial regression, the researchers have also plotted graphs for various models on
different powers to observe the model as the feature complexity changes.
4. ‘A hybrid regression technique for house price prediction’ by Lu, Li and Yang,
2017
They examined the creative feature engineering and proposed a hybrid Lasso and
Gradient boosting regression model that assures better prediction. They used Lasso
regression selection of features. They used the same dataset as the one used in this
study. They did many repetitions in feature engineering to find the optimum number
of features that can improve the prediction. More features were added, better the
score evaluation they received from the Kaggle. Hence, they added 400 features on
top of the 79 given features. Furthermore, they used Lasso for feature selection to
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 6
remove the unused features and found that 230 features providing the best score by
running a test on Ridge, Lasso and gradient boosting.
5. ‘The logistic lasso and ridge regression in predicting corporate failure’ by Jose
Manuel Pereira, Mario Basto and Amelia Ferreira da Silva, 2016
The authors performed an analysis to look into three methods. Lasso, Ridge and
Stepwise Regression were performed in SPSS to develop an empirical formula to
predict corporate bankruptcy. They defined two types of errors. The first error is the
percentage of failed enterprises predicted failed by the model. The second error was
the percentage of good enterprises predicted failed by the model. The results of this
study showed that the lasso and ridge algorithms usually favour the categorical
variable which appears with more weightage in the training set when they are
compared to the stepwise algorithm performed in SPSS.
A study was accomplished in 2017 by SunaAkkol, Ash Akilli, Ibrahim Cemal, where
they did a comparison of artificial neural network and multiple linear regression for
prediction. In their study, the impact of different morphological measures on live
weight has been modelled by artificial neural networks and multiple linear regression
analyses. They used three different back-propagation techniquesforANN,namely
Levenberg-Marquardt, Bayesian regularisation, and Scaled conjugate. They showed
that ANN is more successful than multiple linear regression in the prediction they
performed.
The authors estimated the stock price of activated companies in Tehran 13(Iran) stock
exchange by using Linear Regression and Artificial Neural Network algorithms. The
authors considered ten macroeconomic variables and 30 financial variables. Then,
they obtained seven final variables, including three macroeconomic variables and four
financial variables, to estimate the stock price using Independent Components
Analysis (ICA). They showed that the value of estimation error square mean, the
absolute mean of error percentage and R square coefficient will be decreased
significantly after training the model with ANN.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 7
8. ‘An empirical analysis of the price development on the Swedish housing market’by
Nils Landberg, 2015.
The author analysed the price development on the Swedish housing market and the
influences of qualitative variables on Swedish house prices. Landberg has studied the
impact of square meter price, population, new houses, new companies, foreign
background, foreign-born, unemployment rate, the number of breaks-in, the total
number of crimes, the number of available jobs ranking. According to Nils,
unemployment rate, number of crimes, interest rate, and new houses have a negative
effect on house prices. Landberg showed that the real estate market is not easy to be
analysed compared with the goods market because many alternative costs are
affecting the increase in house prices. The study shows that the increase in population
and qualitative variables have a positive effect on house prices. The interest rate, the
average income level, GDP, and the focus 8 In contrast, the rise in interest rates has a
significant negative influence on house prices. Besides, it showed unemployment rate
effects negatively on house prices, but the sale price and unemployment rate are not
directly correlated with each.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 8
CHAPTER - 3
MAIN TEXT
Investment in real estate generally seems to be profitable because their property values do not
decline rapidly. Changes in the real estate price can affect various household investors,
bankers, policy makers and many. Investment in the real estate sector seems to be an
attractive choice for the investments. Thus, predicting the real estate value is an important
economic index. An accurate prediction on the house price is important to prospective
homeowners, developers, investors, appraisers, tax assessors and other real estate market
participants, such as, mortgage lenders and insurers. Traditional house price prediction is
based on cost and sale price comparison lacking an accepted standard and a certification
process. Therefore, the availability of a house price prediction model helps fill up an
important information gap and improve the efficiency of the real estate market.
The purpose of this paper is to allow developers estimate the house’s selling cost. This paper
offers the ability to predict the house price based on property features and measurements such
as square feet of the house, number of bedrooms, environment and other extra amenities,
using certain techniques of data analysis such as regression, factor analysis and other
approaches to machine learning. This strategy would help to preserve price uniformity in the
real estate industry and improve consumer reliability.
To identify the variables affecting house prices, e.g. area, number of rooms,
bathrooms, etc.
To create a linear model that quantitatively relates house prices with variables such as
number of rooms, area, number of bathrooms, etc.
To know the accuracy of the model, i.e. how well these variables can predict house
prices.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 9
In this dataset we have to predict the sales price of houses in King County, Seattle. It includes
homes sold between May 2014 and May 2015. Before doing anything we should first know
about the dataset what it contains what are its features and what is the structure of data.
The dataset contains 20 house features plus the price, along with 21613 observations.
1. Id: - It is the unique numeric number assigned to each house being sold.
2. Date: - It is the date on which the house was sold out.
3. Price: - It is the price of house which we have to predict so this is our target variable
and apart from it are our features.
4. Bedrooms: - It determines number of bedrooms in a house.
5. Bathrooms: - It determines number of bathrooms in a bedroom of a house.
6. Sqft_living :- It is the measurement variable which determines the measurement of
house in square foot.
7. Sqft_lot:- It is also the measurement variable which determines square foot of the lot.
8. Floors: It determines total floors means levels of house.
9. Waterfront: This feature determines whether a house has a view to waterfront 0 means
no 1 means yes.
10. View: - This feature determines whether a house has been viewed or not 0 means no 1
means yes.
11. Condition: - It determines the overall condition of a house on a scale of 1 to 5.
12. Grade:- It determines the overall grade given to the housing unit, based on King
County grading system on a scale of 1 to 11
13. Sqft_above :- It determines square footage of house apart from basement.
14. Sqft_basement:- It determines square footage of the basement of the house.
15. Yr_built:- It detrmines the date of building of the house.
16. Yr_renovated :- It detrmines year of renovation of house.
17. Zipcode :- It determines the zipcode of the location of the house.
18. Latitude: - It determines the latitude of the location of the house.
19. Longitude: - It determines the longitude of the location of the house.
20. Sqft_living15 :- Living room area in 2015(implies - some renovations)
21. Sqft_lot15:- LotSize area in 2015(implies- some renovations)
By observing the data, we can know that the price is dependent on various features like
bedrooms(which is most dependent feature), bathrooms, sqft_living(second most important
feature), sqft_lot, floors etc. The price is also dependent on the location of the house where it
is present. The other features like waterfront, view are less dependent on the price. Of all the
records, there are no missing values, which help us creating better model.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 10
Dependant variable: House price
Independent variables:
i) Bedrooms v) View
ii) Bathrooms vi) Condition
iii) Sqft_living vii) Grade
iv) Sqft_lot viii) Sqft_above
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 11
CHAPTER - 4
DATA ANALYSIS, INTERPRETATION & VISUALIZATION
Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in almost all
research and can have a significant effect on the conclusions that can be drawn from the data.
Missing data present various problems. First, the absence of data reduces statistical power,
which refers to the probability that the test will reject the null hypothesis when it is false.
Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the
representativeness of the samples. Fourth, it may complicate the analysis of the study. Each
of these distortions may threaten the validity of the trials and can lead to invalid conclusions.
> View(housedata)
> attach(housedata)
> colSums(is.na(housedata))
0 0 0 0 0 0 0 0
grade sqft_above
0 0
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 12
4.2. Outliers Detection
Outliers are extreme values that deviate from other observations on data, they may indicate a
variability in a measurement or experimental errors. Detecting outliers is of major importance
for almost any quantitative discipline. Outliers can skew statistical measures and data
distributions, providing a misleading representation of the underlying data and relationships.
Removing outliers from training data prior to modelling can result in a better fit of the data
and, in turn, more skilful predictions.
There are many methods for detection of outliers, in this project we have used the z-score
method for outlier detection. The z-score or standard score of an observation is a metric that
indicates how many standard deviations a data point is from the sample’s mean, assuming a
Gaussian distribution. This makes z-score a parametric method. Z-score is a simple, yet
powerful method to get rid of outliers in data if we are dealing with parametric distributions
in a low dimensional feature space.
>par(mfrow=c(2,2))
>boxplot(bedrooms, main='boxplot bedroom')
>boxplot(bathrooms, main='boxplot bathroom')
>boxplot(sqft_living, main= 'boxplot sqft_living')
>boxplot(sqft_lot, main='boxplot sqft_lot')
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 13
> boxplot(floors, main='boxplot floors')
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 14
> boxplot(sqft_above, main='boxplot sqft_above')
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 15
> #find absolute value of z-score for each value in each column
> head(z_scores)
> #only keep rows in data frame with all z-scores less than absolute value of 3
> dim(no_outliers)
[1] 19908 10
> dim(housedata)
[1] 21613 9
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 16
4.4 Factor Analysis
Factor analysis is a statistical technique to reduce the number of factors from a large number
of measured variables. There are mainly two types of factor analysis in market research –
Exploratory factor analysis (EFA) & Confirmatory factor analysis (CFA).
EFA is a data driven factor analysis but CFA is performed according to the model, equation
or the hypothesis.
Key Decision i. Should we conduct factor analysis for this study or not!
for Factor
Analysis:- ii. ‘Exploratory factor analysis’ or ‘Confirmatory factor analysis
- which factor analysis should we perform!
Purpose:- Factor analysis helps in factor reduction from a large number of variables
and helps in analysing the independence of the variable. It mainly uses
correlation technique to understand the overlapping of the factors and helps
in reducing those in a significant way.
In this study, for factor analysis we are considering those variables only
which we used as independent variables for multiple regression analysis. By
this we want to know whether those variables are independent or not.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 17
bedrooms view
bathrooms condition
sqft_living grade
sqft_lot sqft_above
The above 8 variables are taken into consideration for Factor Analysis.
Next step is to determine the method of factor analysis. Two basic approaches of factor
analysis are – Principal Component Analysis, Common Factor Analysis.
In ‘Common Factor Analysis’ the factors are estimated based only on the common variance.
Communalities are inserted in the diagonal of the correlation matrix. This method is
appropriate when the primary concern is to identify the underlying dimensions and the
common variance is of interest.
Descriptive Statistics
Mean Std. Deviation Analysis N
bedrooms 3.37 .930 21613
bathrooms 2.1148 .77016 21613
sqft_living 2079.90 918.441 21613
sqft_lot 15106.97 41420.512 21613
view .23 .766 21613
condition 3.41 .651 21613
grade 7.66 1.175 21613
sqft_above 1788.39 828.091 21613
The first output from the analysis is a table of descriptive statistics for all the variables under
investigation. Typically, the mean, standard deviation and number of respondents (N) who
participated in the survey are given. There are total 21613 responses.
Looking at the mean, one can conclude that ‘sqft_lot’ is the most important variable that
influences the house price most. It has the highest mean of 15106.97
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 18
Correlation Matrixa
bedrooms bathrooms sqft_living sqft_lot view condition grade sqft_above
Correlation bedrooms 1.000 .516 .577 .032 .080 .028 .357 .478
bathrooms .516 1.000 .755 .088 .188 -.125 .665 .685
sqft_living .577 .755 1.000 .173 .285 -.059 .763 .877
sqft_lot .032 .088 .173 1.000 .075 -.009 .114 .184
view .080 .188 .285 .075 1.000 .046 .251 .168
condition .028 -.125 -.059 -.009 .046 1.000 -.145 -.158
grade .357 .665 .763 .114 .251 -.145 1.000 .756
sqft_above .478 .685 .877 .184 .168 -.158 .756 1.000
Sig. (1- bedrooms .000 .000 .000 .000 .000 .000 .000
tailed) bathrooms .000 .000 .000 .000 .000 .000 .000
sqft_living .000 .000 .000 .000 .000 .000 .000
sqft_lot .000 .000 .000 .000 .094 .000 .000
view .000 .000 .000 .000 .000 .000 .000
condition .000 .000 .000 .094 .000 .000 .000
grade .000 .000 .000 .000 .000 .000 .000
sqft_above .000 .000 .000 .000 .000 .000 .000
a. Determinant = .018
The next output from the analysis is the correlation coefficient. A correlation matrix is simple
a rectangular array of numbers which gives the correlation coefficients (r) between a single
variable and every other variables in the investigation. The correlation coefficient between a
variable and itself is always 1, hence the principal diagonal of the correlation matrix contains
1s. The correlation coefficients above and below the principal diagonal are the same. With
respect to Correlation Matrix Correlation value more than 0.5 show strong relation between
the variables.
The KMO measures the sampling adequacy (which determines if the responses given with
the sample are adequate or not) which should be close than 0.5 for a satisfactory factor
analysis to proceed. Kaiser (1974) recommend 0.5 (value for KMO) as minimum (barely
accepted), values between 0.7-0.8 acceptable, and values above 0.9 are superb.
Looking at the above table, the KMO measure is 0.816, which is more than 0.800 and
therefore the value is almost superb.
Bartlett’s test is another indication of the strength of the relationship among variables.
This tests the null hypothesis that the correlation matrix is an identity matrix.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 19
An identity matrix is matrix in which all of the diagonal elements are 1 and all off diagonal
elements are close to 0. We want to reject this null hypothesis. From the above table, we can
see that the Bartlett’s Test of Sphericity is significant (0.000). This means that correlation
matrix is not an identity matrix.
Communalities
Initial Extraction
bedrooms 1.000 .564
bathrooms 1.000 .741
sqft_living 1.000 .901
sqft_lot 1.000 .744
view 1.000 .451
condition 1.000 .845
grade 1.000 .730
sqft_above 1.000 .830
Extraction Method: Principal Component Analysis.
The next item from the output is a table of communalities which shows how much of the in
the variables has been accounted for by the extracted factors.The table shows the value of
communalities for both initial and extraction. Initial value 1.000 defines that all the variables
are fully involved in factor analysis. The extraction communalities show the final
communalities value which is less than the initial communalities. Extraction communalities
value more than 0.400 indicates huge impact of that variable in factor analysis.
Eigenvalue actually reflects the number of extracted factors whose sum should be equal to
number of items which are subjected to factor analysis. The next item shows all the factors
extractable from the analysis along with their eigenvalues.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 20
The Eigenvalue table has been divided into three sub-sections, i.e. Initial Eigen Values,
Extracted Sums of Squared Loadings and Rotation of Sums of Squared Loadings. For
analysis and interpretation purpose we are only concerned with Extracted Sums of Squared
Loadings. Here one should note that Notice that the first factor accounts for 46.647% of the
variance, the second 13.305%, and the third 12.618% of the variance. All the remaining
factors are not significant.
The scree plot is a graph of the eigenvalues against all the factors. The graph is useful for
determining how many factors to retain. The point of interest is where the curve starts to
flatten. It can be seen that the curve begins to flatten between factors 3 and 4. Note also that
factor 4 onwards have an eigenvalue of less than 1, so only 3 factors have been retained.
Component Matrixa
Component
1 2 3
sqft_living .947
sqft_above .906
bathrooms .851
grade .850
bedrooms .640
condition .837
view .533
sqft_lot .807
Extraction Method: Principal Component Analysis.
a. 3 components extracted.
The above table shows the loadings (extracted values of each item under 3 factors) of the 8
variables on the 3 factors extracted. The higher the absolute value of the loading, the more the
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 21
factor contributes to the variable (We have extracted 3 factors wherein the 13 items are
divided into 3 factors according to most important items which similar responses in
component 1 and simultaneously in component 2 and 3)
The idea of rotation is to reduce the number factors on which the variables under
investigation have high loadings. Rotation does not actually change anything but makes the
interpretation of the analysis easier. Rotation Method used here is Varimax rotation with
Kaiser Normalization.
Here we can find that sqft_living, sqft_above, bathrooms, grade, bedrooms are loaded under
factor 1.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 22
Reproduced Correlations
bedrooms bathrooms sqft_living sqft_lot view condition grade sqft_above
Reproduced bedrooms .564a .581 .623 -.165 .140 .108 .520 .559
Correlation a
bathrooms .581 .741 .805 .066 .200 -.144 .723 .774
sqft_living .623 .805 .901a .177 .320 -.076 .799 .850
sqft_lot -.165 .066 .177 .744a .397 -.122 .197 .187
view .140 .200 .320 .397 .451a .308 .245 .242
condition .108 -.144 -.076 -.122 .308 .845a -.195 -.222
grade .520 .723 .799 .197 .245 -.195 .730a .778
sqft_above .559 .774 .850 .187 .242 -.222 .778 .830a
Residualb bedrooms -.065 -.046 .196 -.061 -.079 -.163 -.081
bathrooms -.065 -.050 .022 -.012 .019 -.058 -.088
sqft_living -.046 -.050 -.005 -.035 .017 -.037 .026
sqft_lot .196 .022 -.005 -.323 .113 -.083 -.003
view -.061 -.012 -.035 -.323 -.262 .006 -.074
condition -.079 .019 .017 .113 -.262 .050 .064
grade -.163 -.058 -.037 -.083 .006 .050 -.022
sqft_above -.081 -.088 .026 -.003 -.074 .064 -.022
Extraction Method: Principal Component Analysis.
a. Reproduced communalities
b. Residuals are computed between observed and reproduced correlations. There are 16 (57.0%)
nonredundant residuals with absolute values greater than 0.05.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 23
‘Component Score Coefficient Matrix’ table shows the coefficient of each variable. The
rotation method applied here is orthogonal varimax rotation.
Multiple Linear Regression (MLR) is a supervised technique used to estimate the relationship
between one dependent variable and more than one independent variables. Identifying the
correlation and its cause-effect helps to make predictions by using these relations. To
estimate these relationships, the prediction accuracy of the model is essential; the complexity
of the model is of more interest. However, Multiple Linear Regression is prone to many
problems such as multicollinearity, noises, and over fitting, which effect on the prediction
accuracy.
Regularised regression plays a significant part in Multiple Linear Regression because it helps
to reduce variance at the cost of introducing some bias, avoid the over fitting problem and
solve ordinary least squares (OLS) problems. There are two types of regularisation
techniques L1 norm (least absolute deviations) and L2 norm (least squares). L1 and L2 have
different cost functions regarding model complexity.
In this study the variable which is predicting is known as dependent variable and independent
variable are those on the basis of which predictions are made. Regression and correlation
analysis helps in understanding the relations between two or more variables.
Ŷ = ß + ß X + ß X + ………………+ Ԑ
0 1 1 2 2 Multiple regression model
ŷ = b + b x +b x + ……….. + e
0 1 1 2 2 Regression equation
Where;
ŷ is the dependent variable.
b b are the coefficients of corresponding to the x variable.
1, 2
x x are the independent variables, e is the error associated with the dependent variable.
1, 2
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 24
Reason for using Regression Analysis for this Study:
Purpose :- Here in this study our objective is to predict the house price in King
County.
The dependent variable considers here in this study is the ‘House Price’
which signifies the sales of houses in King County. The independent
variables help in predicting the price of house and find the correlation
among the variables.
In this dataset the sales price of houses in King County, Seattle are present. It includes homes
sold between May 2014 and May 2015. Before doing anything we should first know about
the dataset what it contains what are its features and what is the structure of data.
By observing the data, we can know that the price is dependent on various features like
bedrooms (which is the most dependent feature), bathrooms, sqft_living (second most
important feature), sqft_lot, Sqft_above, Yr_built, Sqft_living15, Sqft_lot15. Of all the
records, there are no missing values. Initially the data set contains 21613 observations, but
after removing the outliers we get 20371 observations. Now this data set will help us creating
better model.
Y= Price X = View
5
X = Bedrooms
1 X = Condition
6
X = Bathrooms
2 X = Grade
7
X = Sqft_living
3 X = sqft_above
8
X = Sqft_lot
4
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 25
The above figure shows the data set in which the dependent and independent variables are
listed to perform the multiple linear regression analysis.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 26
sqft_living sqft_lot view
Mean 2079.899736 Mean 15106.96757 Mean 0.234303428
Standard Error 6.247319071 Standard Error 281.7461116 Standard Error 0.005212562
Median 1910 Median 7618 Median 0
Mode 1300 Mode 5000 Mode 0
Standard Deviation 918.440897 Standard Deviation 41420.51152 Standard Deviation 0.766317569
Sample Variance 843533.6814 Sample Variance 1715658774 Sample Variance 0.587242617
Kurtosis 5.24309299 Kurtosis 285.0778197 Kurtosis 10.89302168
Skewness 1.471555427 Skewness 13.06001896 Skewness 3.395749593
Range 13250 Range 1650839 Range 4
Minimum 290 Minimum 520 Minimum 0
Maximum 13540 Maximum 1651359 Maximum 4
Sum 44952873 Sum 326506890 Sum 5064
Count 21613 Count 21613 Count 21613
Central tendency. Themean and themedian are summary measures used to describe
central tendency - the most "typical" value in a set of values. With a normal
distribution, the mean is equal to the median.
Skewness. Skewness is a measure of the asymmetry of a probability distribution. If
observations are equally distributed around the mean, the skewness value is zero;
otherwise, the skewness value is positive or negative. As a rule of thumb, skewness
between -2 and +2 is consistent with a normal distribution.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 27
Kurtosis. Kurtosis is a measure of whether observations cluster around the mean of
the distribution or in the tails of the distribution. The normal distribution has a
kurtosis value of zero. As a rule of thumb, kurtosis between -2 and +2 is consistent
with a normal distribution.
Together, these descriptive measures provide a useful basis for judging whether a dataset
satisfies the assumption of normality.
The mean is nearly equal to the median. And both skewness and kurtosis are between -2 and
+2.
print(housedata.shape)
(21613, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 28
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
<matplotlib.axes._subplots.AxesSubplot at 0x4d44128310>
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 29
#heatmap for all the required features for correlation with the
variables which we will consider for regression
df=housedata[['price','bedrooms','bathrooms','sqft_living',
'sqft_lot','view','condition','grade',
'sqft_above']]
fig,ax=plt.subplots(figsize=(10,10))
sns.heatmap(df.corr(),cmap='YlGnBu',annot=True,ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x4d444049a0>
The above table shows the correlation between the variables that are considered for multiple
regressions. The correlation table also helps in finding the multicollinearity between the
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 30
dependent and independent variables. Multicollinearity exists when the correlation value is
beyond 0.80. In this study all the ‘Pearson Correlation’ values are below 0.80 which signifies
there is no multicollinearity.
sns.pairplot(data=housedata,
x_vars=["bedrooms","bathrooms","sqft_living","sqft_lot","grade","sqft_a
bove"], y_vars=["price"])
<seaborn.axisgrid.PairGrid at 0x4d451ba1c0>
fromstatsmodels.formula.apiimport ols
fromsklearn.linear_modelimport LinearRegression
fromstatsmodels.formula.apiimport ols
fromstatsmodels.stats.anovaimport anova_lm
importmatplotlib.pyplotasplt
reg_model=ols(formula="price~bedrooms+bathrooms+sqft_living+sqft_lot+vi
ew+condition+grade+sqft_above",data=housedata).fit()
print(reg_model.summary())
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 31
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 4.87e+05. This might indicate that
there are
strong multicollinearity or other numerical problems.
R- Square value shows the goodness of fit. R- Square value lies in between 0 to 1.
Here R-square value is 0.590 which signifies that the dependent variable, ‘Price of
House’ is 59.00% in association with the independent variables. R-square value can
only increase if we add extra explanatory variables.
Adjusted R- Square is a measure that adjusts R-Square for a number of independent
variables in the regression model. It helps in analysing that the extra independent
variables which are added have really any significant or not in the regression model.
If by adding extra independent variables, the adjusted R-Square value decreases, then
we can eliminate those variables.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 32
The Durbin Watson Statistic is a test statistic used in statistics to detect autocorrelation in
the residuals from aregression analysis. It is named after professor James Durbin, a British
statistician and econometrician, and Geoffrey Stuart Watson, an Australian statistician.
Serial correlation, also called autocorrelation, refers to the degree ofcorrelation between the
values of variables across different data sets. It is usually used when working with time series
data in which observations occur at different points in time.
The Durban Watson statistic will always assume a value between 0 and 4. A value of DW = 2
indicates that there is no autocorrelation. When the value is below 2, it indicates a positive
autocorrelation, and a value higher than 2 indicates a negative serial correlation.
To test for positive autocorrelation at significance level α (alpha), the test statistic DW is
compared to lower and upper critical values:
If DW < Lower critical value: There is statistical evidence that the data is positively
autocorrelated
If DW > Upper critical value: There is no statistical evidence that the data is positively
correlated.
If DW is in between the lower and upper critical values: The test is inconclusive.
2 is no autocorrelation.
0 to <2 is positive autocorrelation (common in time series data).
>2 to 4 is negative autocorrelation (less common in time series data).
A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively normal.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 33
The significance of the entire regression model is determined using the F- Statistics.
Degree of Freedom (DoF) for the regression model is equal to the number of independent
variables (k) which is equal to 8.
From the coefficient table we can find the value of the coefficients (ß – values) for the
corresponding independent variables.
anova_lm(reg_model)
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 34
The probability value of all the variables is almost nearly equal to zero. Now, considering
95% confidence interval we can say all the variables are statistically significant for the
analysis and has an impact on regression model.
frompatsyimport dmatrices
fromstatsmodels.stats.outliers_influenceimport
variance_inflation_factor
vif
VIF variable
0 119.844731 Intercept
1 1.628989 bedrooms
2 2.55807 bathrooms
3 6.918588 sqft_living
4 1.049917 sqft_lot
5 1.155793 view
6 1.081997 condition
7 2.841084 grade
8 5.08775 sqft_above
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 35
The value for Variance Influence Factor(VIF) starts at 1 and has no upper limit. A general
rule of thumb for interpreting VIFs is as follows:
From the above result we can interpret the variable bedroom, bathroom, sqft_lot, view,
condition and grade had low corelation with other independent variables.
For the variable sqft_living and sqft_above is having moderate corelation with other
independent variables.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 36
4.5. Data Visualization
Software MS PowerBI
Used:-
Purpose:- This analysis helps in understanding how different variables are impacting
on house price.
The above line diagram shows how ‘price of houses’ has been fluctuating with the change of
the date of building of the house on the price of the house (‘yr_built’)
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 37
The above data dashboard shows how ‘average of price’ of house has been changing for
different variables like - ‘floors’, ‘conditions’, ‘grade’ and ‘view’
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 38
The above two line diagrams show how the 'average price’ of a house has been changing for
variables like ‘sqft_living’ and ‘sqft_above’.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 39
CHAPTER 5
RESULT & DISCUSSION
Here for this study our aim is to predict the house price of King County, USA. Secondary data
has been collected from Kaggle and it has been analyzed using analytical tools. Different
variables of house price prediction have been considered to meet the objective of the study
like; bedrooms (which is the most dependent feature), bathrooms, sqft_living (second most
important feature), sqft_lot etc.
Factor Analysis has been performed with the data to understand the most important
factors those have direct impact on house price.
Exploratory factor analysis has been performed in this study to reduce factors from large
number of variables.
Correlation table shows the relation between the considered variables. Correlation value more
than 0.500 show strong correlation between the variables.
Here the KMO value is 0.816 which is highly acceptable. So we can say sampling adequacy
has been achieved.
The Sig. value in Bartlett's Test of Sphericity is 0.000 which is less than 0.05 and indicates
that the model is statistically significant.
Extraction communalities value more than 0.400 indicates huge impact of that variable in
factor analysis.
Scree plot shows that factor 4 onwards have an eigenvalue of less than 1, so only 3 factors
have been retained.
‘Multiple linear regression and correlation’ analysis helps in understanding the relation
between the variables which are considered for understanding customer experience or
satisfaction. To predict the sales price of a house we have used Multiple Linear
Regression Analysis.
The output of the analysis shows that there are no multicolinearity lies between the
considered variables. Here in this study, R- square value is 0.590 which signifies that the
dependent variable, ‘house price’ is 59.00% in association with the independent variables.
The F-test from the ANOVA table shows that at 95% level of confidence, the model is
statistically significant.
From the coefficient table the multiple regression model can be derived as –
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 40
Ŷ = -681800 - 36630 * bedroom - 14990 * bathroom + 217.64 * sqft_living - 0.33 *
sqft_lot + 87550 * view + 54020 * condition + 100700 * grade - 25.74 * sqft_above
The above equation helps in predicting the house price with the change of the values of the
independent variables.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 41
CHAPTER 6
CONCLUSION
Recent real estate statistics prove beyond doubt that property valuations have taken a turn for
the better. The real estate sector definitely is on the rise, with the growth thrust being
provided by important factors such as demographics, interest rates, location and the state of
the economy, which affect the prices of property in the country. Correct insights in regards to
the right time for purchase of property, price escalations, recessions in the real estate market
and other indicators, help in making valuable purchase decisions. The housing market is
influenced by the state of the economy, interest rates, real income and changes in the size of
the population. As well as these demand-side factors, house prices will be determined by
available supply. With periods of rising demand and limited supply, we will see rising house
prices, rising rents and increased risk of homelessness.
As data acquisition is considered one of the biggest difficulties in the study of housing prices,
the present study uses an approach using the data collected from Kaggle of the House Price of
King County, USA which contributes to the methodology of research on housing prices.
These open network data, offer opportunities to acquire more diverse types and a much
greater number of sample and with greater timeliness, thus providing more valuable findings
forthe quantitative spatial analysis of different regions.
From the planning perspective, the present study chooses various factors that may influence
housing price, so as to make clear the relationship between the influence factors and housing
prices as well as their variations in space.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 42
This project demonstrates how the statistical analysis could be utilized tobetter analyse
investment. With an application of ‘Multiple Linear Regression’ and ‘Factor Analysis’, the
contribution of each price determinant for the overall price of a house can be predicted and
understand the important underlying factors. The analysis thus, aids the decision making
process. In this case, thepotential investors or developers will be able to identify important
factors to be taken into consideration when developing and buying a house. As the
contribution of each variable could be quantified, it is possible to determine the significance
of each variable. The application of multiple regression analysis in a house data set explains
or models variation in house price which demonstrates good examples of the strategic
application of mathematical tools to aid analysis hence decision making in property
investment.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 43
CHAPTER 7
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 44
CHAPTER 8
In today’s real estate world, it has become tough to store such huge data and extract them for
one’s own requirement. Also, the extracted data should be useful. The system makes optimal
use of the Linear Regression Algorithm. The system makes use of such data in the most
efficient way. The linear regression algorithm helps to fulfil customers by increasing the
accuracy of estate choice and reducing the risk of investing in an estate. A lot of features that
could be added to make the system more widely acceptable. More factors like recession that
affect the house prices shall be added. In-depth details of every property will be added to
provide ample details of a desired estate. This will help the system to run on a larger level.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 45
CHAPTER 9
REFERENCES
iii. “Real estate value prediction using multivariate regression models” by R Manjula,
Shubham Jain, Sharad Srivastava and Pranav Rajiv Kher ; November,2017
iv. ‘A hybrid regression technique for house price prediction’ by Lu, Li and Yang, 2017
v. ‘The logistic lasso and ridge regression in predicting corporate failure’ by Jose
Manuel Pereira, Mario Basto and Amelia Ferreira da Silva, 2016
vi. ‘Comparison of artificial neural network and multiple linear regression for prediction
of live weight in hair goats’ by Yyu J. Agric. Sci. 2017
vii. ‘The comparison of methods of ANN with linear regression using specific variables’
by Reza Gharoie Ahangar, Mahmood Yahyazadehfar and Hassan Pournaghshband,
2010 .
viii. ‘An empirical analysis of the price development on the Swedish housing market’by
Nils Landberg, 2015.
ix. ANDERSON, SWEENEY, WILLIAMS. Statistics for Business & Economics
x. P SASHIKALA, adapted. 2019. Business Analytics. Delhi: Cengage Learning India
Private Limited.
xi. P SASHIKALA, adapted. 2020. Advanced Business Analytics. Delhi: Cengage
Learning India Private Limited.
xii. CHURCHILL, LACOBUCCI, ISRAEL. Marketing Research: A South Asian
Prespective. Cengage Learning.
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 46
xiv. Fundamentals data analysis & decision making models – theory. [Video] Instructed
by Manish Gupta. Udemy.
xv. wellbeing@school. Understanding and interpreting box plots. [Online]. Available
from: wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-
box-plots [Accessed 20 November 2020].
xvi. Priya Chetty. 2015; “Interpretation of factor analysis using SPSS”; Available on:
projectguru.in; [Accessed 4th January, 2021]
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 47
CHAPTER - 10
APPENDICES
A Statistical Analysis on House Price Prediction and Understanding its Determinants using Machine Learning Techniques Page 48