You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/354066581

House Price Prediction Using Machine Learning

Article in The Journal of Philosophy Psychology and Scientific Methods · June 2021

CITATIONS READS
2 1,886

5 authors, including:

Robart Konwar Angshuman Kakati


Kaziranga University Kaziranga University
1 PUBLICATION 2 CITATIONS 1 PUBLICATION 2 CITATIONS

SEE PROFILE SEE PROFILE

Monoj Kumar Muchahari


VIT Bhopal University
13 PUBLICATIONS 77 CITATIONS

SEE PROFILE

All content following this page was uploaded by Monoj Kumar Muchahari on 23 August 2021.

The user has requested enhancement of the downloaded file.


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

House Price Prediction Using Machine Learning


Robart Konwar1, Angshuman Kakati2, Bhagyashree Das3, Divyansh Borah Shah4,
Dr. Monoj Kumar Muchahari5
1,2,3,4
Department of Information Technology, Kaziranga University
5
Guide, Department of Information Technology, Kaziranga University

-----------------------------------------------------------------------*****************---------------------------------------------------------------

ABSTRACT

The relationship between house prices and the economy is an important motivating factor for predicting house
prices. A property’s value is important in real estate transactions. Housing price trends are not only the concern of
buyers and sellers, but it also indicates the current economic situation. Therefore, it is important to predict housing
prices without bias to help both the buyers and sellers make their decisions.

INTRODUCTION

An accurate prediction on the house price is important to prospective homeowners,developers, investors, appraisers, tax
assessors and other real estate market participants, such as, mortgage lenders and insurers .Development of civilization is
the foundation of increase of demand of houses day by day. Accurate prediction of house prices has been always a
fascination for the buyers, sellers and for the bankers also. A housing market can be understood as any market for
properties which are negotiated either directly from their owners to buyers, or through the services of real state brokers.
People and companies are drawn to this market, which presents many profifit opportunities that come from housing
demands worldwide. These demands are inflfluenced by several factors, such as demography, economy, and politics.For
this reason Machine learning develops algorithms and builds models from data, and uses them to predict on new
data.Buying of a house is one of the greatest and significant choice of a family as it expends the entirety of their investment
funds and now and again covers them under loans. It is the difficult task to predict the accurate values of house pricing. Our
proposed model would make it possible to predict the exact prices of houses.
For house price prediction, there are many useful regression algorithms to use. For example, support vector machines
(SVM), Random Forest Algorithm, Linear Regression Algorithm and XG Boost Algorithm etc. We will investigate and
explore them in our Experimental Analysis.

For our research project, we have considered Boston as our primary location and are predicting real-time house prices for
various localities in and around Boston. We have used parameters like 'square feet area', 'no. of Bedrooms', 'No of
Bathrooms', 'Type of Flooring', 'Lift availability' ,'Parking availability' ,'Furnishing condition'. We have taken into account a
verified dataset from Keggle. We have used various algorithms explained below in various combinations and the result for
each algorithm is given based on the accuracy percentage.

Related Work
There are many works that focused on training models to predict the house price for a particular region. There are
researches where the authors use different machine learning algorithm for prediction.

A research was conducted by Bruno Klaus de Aquino Afonso, LuckecianoCarvalhoMelo, WillianDihanster Gomes de
Oliveira, Samuel Bruno da Silva Sousa and LilianBerto. They analyzes a dataset composed of 12,223,582 housing
advertisements, collected from Brazilian websites from 2015 to 2018, for prediction. Each instance comprises twenty-four
features of five different data types: integer, date, string, float, and image. They ensemble two different Machine Learning
architectures, based on Random Forest(RF) and Recurrent Neural Networks (RNN), to predict the property prices. Their
study about the use of machine learning algorithm demonstrates that enriching the dataset and combining different ML
approaches can be a better alternative for prediction of housing prices in Brazil.

In 2017 a research was conducted by Sifei Lu, Zengxiang Li, and Zheng Qin, Xulei Yang and Rick SiowMongGoh, where
they examined creative feature engineering. They proposed a hybrid lasso and Gradient boosting regression model which
help them to have a better prediction. They used Lasso for feature selection to remove the unused features. To improve the

IJARESM Publication, India >>>> www.ijaresm.com Page 3308


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

prediction performance, they did many iterations of feature engineering and try to find the optimal number of features. In
their study, they got 79 features from the Kaggle website. Later, when they analyze that if they add more features than they
will receive better score evaluation. Hence, they added 400 features on the top of the given features.
A research work by Li Yu, Chenlu Jiao, HongrunXin, Yan Wang and Kaiyang Wang, where they proposed to build
different prediction models based on deep learning to predict the housing price. Their proposed prediction model includes
two categories; the first one is based on multiple characteristic factors of the real estate and second one is time series model.
In the implementation of prediction models, they built three kinds of models logistic regression, Convolutional neural
network (CNN) and Long Short-Term Memory (LSTM), to predict according to a variety characteristics of the real estate.

They also studied that the housing prices are also related to time factor and accordingly they proposed LSTM-1 model
purely regard to time series and the Auto-Regressive and Moving Average (ARMA) model.

Motivation
The relationship between house price and the economy is an important motivating factor for predicting house prices.
Therefore, it is important to predict housing price without bias to help both the buyers and sellers make their decisions.This
project is proposed to predict house prices and to get better and accurate results. In this project various algorithm is applied
to see which algorithm has the most accurate and precise results.

This would be of great help to the people because the house pricing ids a topic that concerns a lot of citizens whether rich or
middle class as one can never judge or estimate the pricing of a house on the basis of locality or facilities available. To
fulfil this task, the python programming language is used.

METHODOLOGY

In this part, we describe the details of creative feature engineering, and describe how to apply multiple regression
algorithms.

Firstly the feature names were added to the dataframe which are as follows-
 CRIM per capita crime rate by town
 ZN proportion of residential land zoned for lots over 25,000 sq.ft.
 INDUS proportion of non-retail business acres per town
 CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 NOX nitric oxides concentration (parts per 10 million)
 RM average number of rooms per dwelling
 AGE proportion of owner-occupied units built prior to 1940
 DIS weighted distances to five Boston employment centres
 RAD index of accessibility to radial highways
 TAX full-value property-tax rate per 10,000usd
 PTRATIO pupil-teacher ratio by town
 B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
 LSTAT % lower status of the population

Then the shape, type and unique number of values of the dataset were identified and the missing values were found to be 0.
The correlation between features was founded and then the heatmap was plotted as given below. Then the target variable
and the independent variables were splitted to training and testing data. Now the data was ready to perform different
algorithms. The model evaluations were done with the following criterias-
 𝑅^2: It is a measure of the linear relationship between X and Y. It is interpreted as the proportion of the variance
in the dependent variable that is predictable from the independent variable.
 Adjusted 𝑅^2: The adjusted R-squared compares the explanatory power of regression models that contain
different numbers of predictors.
 MAE: It is the mean of the absolute value of the errors. It measures the difference between two continuous
variables, here actual and predicted values of y.
 MSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all
instead of using the absolute value.
 RMSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all
instead of using the absolute value.

IJARESM Publication, India >>>> www.ijaresm.com Page 3309


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

Fig 1: Heatmap of correlation between the features

Model training and testing


To train model the libraries were imported for the algorithm, aregressor of the model was created and trained using the
training data of the libraries. Then the model evaluation was done to with the above mentioned criteria. To be cleara
visualization of the difference between actual price and the predicted values was done with the help of plotting graph and
the residuals were checked.For an algorithm to be accurate it should be error free and so the normality of errors were
checked by plotting in different graphs.

To test the model the test data is used. The models were used to predict the test data and then the model evaluation has been
done once again for the algorithm. The whole process has been repeated for all the models. Then all the evaluations of all
the models were collected and the models were compared using the R-squared Score.

Experimental Analysis
After processing the different features, we started running the models to predict housing prices on the boston properties
dataset. Then taking in account of the training and testing dataset the results were analysed and ae as follows

IJARESM Publication, India >>>> www.ijaresm.com Page 3310


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

Linear Regression
After the model was trained it was evaluated and also after the testing the results were again evaluated. The following
results were found after the evaluation of the model before and after testing.

Fig 2: Evaluation scores after training. Fig 4: Evaluation scores after testing.

While visualizing the differences between the actual price and predicted values the following graph is found

Fig 4: Actual pricesvs Predicted prices.

While checking the residuals the following graph was found

Fig 5: Predicted pricesvs Residuals.

IJARESM Publication, India >>>> www.ijaresm.com Page 3311


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

By looking at the above graphs we have concluded that there is no pattern visible in this plot and values are distributed
equally around zero. So Linearity assumption is satisfied.
We also checked the normality of errors in the model with histogram and we found that the residuals are normally
distributed. So normality assumption is satisfied.

Fig 6: Normality of errors.

Random Forest Regressor


Here also we evaluated the training and testing data for the algorithm and we found the following results.

Fig 7: Evaluation scores after training. Fig 8: Evaluation scores after testing.

While visualizing the differences between the actual price and predicted values the following graph is found

Fig 9: Actual pricesvs Predicted prices.

IJARESM Publication, India >>>> www.ijaresm.com Page 3312


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

While checking the residuals the following graph was found

Fig 10: Predicted prices vs Residuals.

By looking at the above graphs we have concluded that there is no pattern visible in this plot as well and values are
distributed equally around zero. So Linearity assumption is satisfied.

SVM Regressor
Here also we evaluated the training and testing data for the algorithm and we found the following results:

Fig 11: Evaluation scores after training. Fig 12: Evaluation scores after testing.

While visualizing the differences between the actual price and predicted values the following graph is found:

Fig 13: Actual prices vs Predicted prices.

IJARESM Publication, India >>>> www.ijaresm.com Page 3313


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

While checking the residuals the following graph was found:

Fig 14: Predicted prices vs Residuals.

By looking at the above graphs we have concluded that there is no pattern visible in this plot as well and values are
distributed equally around zero. So Linearity assumption is satisfied here as well.

XGBoostRegressor
Finally at the last model also we evaluated the training and testing data for the algorithm and we found the following
results.

Fig 15: Evaluation scores after training. Fig 16: Evaluation scores after testing.

While visualizing the differences between the actual price and predicted values the following graph is found

Fig 17: Actual prices vs Predicted prices.

IJARESM Publication, India >>>> www.ijaresm.com Page 3314


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

While checking the residuals the following graph was found

Fig 18: Predicted prices vs Residuals.

By looking at the above graphs we have concluded that there is no pattern visible in this plot as well and values are
distributed equally around zero. So Linearity assumption is satisfied here as well.

Evaluation AndComparision
So by looking at the above data we found that all the models satisfy the Linearity assumption.

Then all the models are compared using R-squared Score and the following table was found:

Fig 19: R-squared Score.

It is found that XGBoost has the highest R-squared Score by scoring 2 more than the second highest score of Random
Forest.
CONCLUSION AND FUTURE WORK

In the present real estate world, it has turned out to be diffificult to store huge amount of information and concentrate them
for one’s own prerequisite. Likewise, the separated information ought to be helpful.In summary, this paper seeks useful
models for house price prediction. It also provides insights into the Boston Housing Market. Firstly, the original data is
prepared and transformed into a cleaned dataset ready for analysis.In our Project firstly, the data collection is performed
and the data is collected from kaggle. Then data cleaning is carried out to remove all the errors from the data and make it
clean. Then the data pre-processing is done. Then with help of data visualization, different plots are created. the preparation
and testing of the model are performed.

IJARESM Publication, India >>>> www.ijaresm.com Page 3315


International Journal of All Research Education and Scientific Methods (IJARESM), ISSN: 2455-6211
Volume 9, Issue 6, June -2021, Impact Factor: 7.429, Available online at: www.ijaresm.com

From our experimental results we can surely conclude that Random Forest provides more accuracy than SVM, Linear
Regression and Xg Boost.

In future, many more algorithms can be applied on this dataset such as decision tree, Naïve Bayes, kNN, K-Means etc. and
find out their respective accuracies and use them to predict a better outcome and hence increase the accuracy.
The classification algorithms can be used and it can also be applied to our house pricing dataset and see if they are being
applied properly or not. The accuracy and precision of these algorithms can also be improved according to our needs. This
would be of great help for the people as they would get to choose from a variety of options open up to them. They can
choose the house that best suits their budgets so that they don’t have to take any kind of loan from the banks.In the future,
an application or a website can also be developed for the House price preiction. That would make it even easier for the
people to select the houses that best suits their budgets.

REFERENCES

[1] Sifei Lu, Zengxiang Li, Zheng Qin, Xulei Yang, Rick SiowMongGoh.
A Hybrid Regression Technique for House Prices Prediction .
[2] AyushVarma, AbhijitSarma, SagarDoshi, Rohini Nair.
House Price Prediction Using Machine Learning And Neural Networks.
[3] Housing Price Prediction using Machine Learning Algorithms: The Case of Melbourne City, Australia.
[4] ParasichAndreyViktorovich, Parasich Viktor Aleksandrovich, Kaftannikov Igor Leopoldovich, Parasich Irina
Vasilevna. Predicting Sales Prices of the Houses Using Regression Methods of Machine Learning.
[5] Debanjan Banerjee, SuchibrotaDutta. Predicting the Housing Price Direction using Machine Learning Techniques.
[6] Mansi Jain, Himani Rajput, NehaGarg, PronikaChawla. Prediction of House Pricing Using Machine Learning with
Python.
[7] Nehal N Ghosalkar, Sudhir N Dhage. Real Estate Value Prediction Using Linear Regression.
[8] Li Yu, Chenlu Jiao, HongrunXin, Yan Wang, Kaiyang Wang.
Prediction on Housing Price Based on Deep Learning.
[9] Zhongyun, Jiang, Guoxin, Shen.
Prediction of House Price Based on The Back Propagation Neural Network in The Keras Deep Learning Framework.
[10] Bruno Klausde Aquino Afonso, LuckecianoCarvalhoMelo, Willian DihansterGomesde Oliveira, Samuel Brunoda
Silva Sousa, LilianBerton. Housing Prices Prediction with a Deep Learning and Random Forest Ensemble.
[11] Feng Wang, Yang Zou, Haoyu Zhang and Haodong Shi. House Price Prediction Approach based on Deep Learning
and ARIMA Model.
[12] Matthew Veres, Griffin Lacey, Graham W. Taylor. Deep Learning Architectures for Soil Property Prediction

IJARESM Publication, India >>>> www.ijaresm.com Page 3316

View publication stats

You might also like