You are on page 1of 5

A. N. Bharathi et al.

; International Journal of Advance Research, Ideas and Innovations in Technology

ISSN: 2454-132X
Impact factor: 4.295
(Volume 5, Issue 1)
Available online at: www.ijariit.com
Predicting housing prices using advanced regression techniques
Bharathi A. N. Dr. N. Yuvaraj Dhivya B.
bharathinandhees1997@gmail.com drnyuvaraj@gmail.com dhivyakrishnan1998@gmail.com
KPR Institute of Engineering and KPR Institute of Engineering and KPR Institute of Engineering and
Technology, Coimbatore, Tamil Nadu Technology, Coimbatore, Tamil Nadu Technology, Coimbatore, Tamil Nadu

ABSTRACT Regression analysis, more specifically, helps one understand


how the typical value of the dependent variable changes when
The prices of House increases every year, so there is a need any one of the independent variables is varied, while the other
for the system to predict house prices in the future. House independent variables are held fixed. One of the main
price prediction can help the developer to determine the advantages of regression-based predicting techniques is that
selling price of a house. It also can help the customer to they use research and analysis to predict what is likely to
arrange the right time to purchase a house. There are some happen in the next quarter, year or even farther into the future.
factors that influence the price of a house which depends on For small-business owners, regression-based forecasting can
physical conditions, concept, location and others. House provide insight into how higher taxes changes in consumer
prices vary for each place and in different communities. spending or shifts in the local economy.
There are various techniques for predicting house prices. One
of the efficient ways is by the use of the regression technique. Regression and forecasting techniques can lend a scientific
Regression is a reliable method of identifying which variables angle to manage small businesses, reducing large amounts of
have an impact on a topic of interest. Random forests are very raw data to actionable information. The dataset taken has the
accurate and robust to over-fitting. The process of performing training set including 1460 houses (i.e., observations)
a regression allows to confidently determine which factors accompanied by 79 attributes (i.e., features, variables, or
matter the most, which factors can be ignored and how the predictors) and the sales price for each house. The testing
factors influence each other. The main objective is to use an set includes 1459 houses with the same 79 attributes, but
advanced methodology for prediction. the sales price was not included as this is our target
variable. In this paper, the proposed house price prediction is
Keywords— House prices, Regression, Price prediction, based on the random forest algorithm.
Lasso regression
1. INTRODUCTION 2. LITERATURE SURVEY
One of the business activity that most people are interested in In a study [3] conducted on the housing prices in the City of
this globalization era is Investment. There are several objects Savannah, Georgia using the hedonic pricing model. The
that are often used for investment, for example, gold, stocks paper’s data contains 2,888 single-family houses for the period
and property [1]. In determining the price of the home, the between 2000 and 2005. It shows that the log price of houses is
developer must carefully calculate and determine the positively and significantly correlated with the number of
appropriate method as the property prices always increase bathrooms, bedrooms, fireplaces, garage spaces, stories and the
continuously and almost never fall in the long or short term [2]. total square feet of the house. Additionally, the paper adds three
dummy variables, May, June, and July, to account for the
Prediction analysis is one among the several approaches that seasonable factor with regards to the houses’ prices. If the
can be used to determine the price of the house. It is a challenge house is sold in May, the variable May is set to be equal to 1
to get as close as a possible result based on the model built. For and 0 otherwise. The other variables, June and July are
a specific house price, it is determined by location, size, house constructed in a similar fashion. The paper finds that the log
type, city, country, tax rules, economic cycle, population sale prices of houses are significantly and positively correlated
movement, interest rate, and many other factors which could with May and July while June is insignificant. This implies that
affect demand and supply. For local house price prediction, houses that are closed in May or July tend to have a higher
there are many useful regression algorithms to use. A set of price.
statistical processes for estimating the relationships among
variables is Regression analysis. It includes many techniques The social and economic impact of housing in the Scottish
for modeling and analyzing several variables when the focus is countryside is examined. Investment in housing finance
on the relationship between a dependent variable and one or impacts the economy directly and indirectly. The employment,
more independent variables (or 'predictors'). GDP, productivity and many other important factors are
© 2019, www.IJARIIT.com All Rights Reserved Page | 370
A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
impacted by Housing finance investment. The study revealed Various Algorithms used for the purpose of predicting Housing
that housing is an important Indicator for increasing the wealth prices are listed below.
of nations. It was then concluded that the Scottish housing
policy objective is to improve the quality standard of housing
as well as to increase the investment in the house old sector.

In research [8] it is found that if significance level is accepted


as 0.05 all the 5 variables in a regression model (Floor, Heating
system, Earthquake Zone, Rental Value and Land Value) have
a significant impact on the dependent variable Value. Land
value and rental value have the highest impact on housing
price. Existing floor, heating system and earthquake zone are Fig. 1: metrics of regression
the following them. Although it is found that the other variable
is not significant in the study, and it can change according to 3.1. Hedonic Pricing Model
the sample size. If the sample size increases, the regression Hedonic price theory assumes that a commodity such as a
model once again is recommended for further studies. The house can be viewed as an aggregation of individual
application of multiple regression analysis in a house data set components or attributes [12]. It is frequently used to measure a
explains or model’s variation in house price which property’s price. Hedonic pricing model combines both the
demonstrated good examples of the strategic application of the internal characteristics of a house(such as the number of
mathematical tool to aid analysis, hence decision making in bedrooms, number of bathrooms, etc.) and its external
property investment. Variation in house price which characteristic (such as neighbourhood’s walkability score,
demonstrated good examples of the strategic application of the public schools’ scores, etc.) to estimate its values. Hedonic
mathematical tool to aid [5] (2010) uses support vector machine pricing can be implemented using the regression models.
(SVM) regression to forecast the housing prices in China in Equation 1 will show the regression model in determining a
between 1993 and 2002 and in a certain district in Tangshan price.
city in Between 2000 to 2002. The paper utilizes the genetic 𝑦 = 𝑎. 𝑥1 + b. 𝑥2 + ⋯ + n. 𝑥1 (1)
algorithm to tune the hyper-parameters in the SVM regression
model. The error scores for the SVM regression model for both Where, y is the predicted price, and x1, x2, xi are the attributes
China and a Tangshan City’s district are both lower than 4%. of a house. While a, b,... n indicate the correlation coefficients
This indicates that the SVM regression model performs well in of each variable in the determination of house prices. While the
forecasting housing prices in China. In Singapore’s housing hedonic technique is an acceptable method for accommodating
market, (2006) decision tree model is used to study the housing attribute differences of a house price determination model, it is
characteristics’ effects on prices [6]. The paper concludes that generally unrealistic to deal with the housing market in any
the owners of 2-room to 4-room flats are more concerned with geographical area as a single unit. Therefore, it seems more
the flats’ basic characteristics such as model type and age more reasonable to introduce geographical information or location
than the owners of 5-or-more-room flats. Moreover, owners of factor into a model that allows shifts in the house price level.
executive flats care more about the services characteristics such
as the neighbourhood location and recreational facilities than 3.2. Artificial Neural Network Model
basic housing characteristics. The use of the neural network model is similar to the process
utilized in building the hedonic price model. However, the
In a research 2014[7] relationships were developed between neural network [13] must first be trained from a set of data. For
various home characteristics and the asking price of a a particular input, the output (estimated house price) is
residential property was analyzed using both a simple linear produced from the model. Then, the model compares the model
regression and the multiple linear regression using a method of output to the actual output (actual house price). The accuracy of
ordinary least squares. Home square footage was utilized as the the value is determined by the total mean square error and then
explanatory variable in the simple linear regression, and the backpropagation is used in an attempt to reduce prediction
multiple linear regression consisted of the addition of land size, errors, which is done through the adjusting of the connection
number of bedrooms, year of construction, and other weights. The performance [14] of the network can be
explanatory variables. The multiple linear regression results influenced by the number of hidden layers and the number of
proved the bias due to the omission of crucial factors in the nodes that are included in each hidden layer. A trial and error
simple linear regression. It was found that Home square footage process is applied to finding the optimal artificial neural
was the most important factor in the determination of network model. It's far complicated than many other models,
residential property price, while garage capacity proved to be such as decision tree and regression. It's hard to interpret and
the weakest factor. understand the weights.
Many previous studies find empirical evidence supporting the 4. PROPOSED METHODOLOGY
significant interrelations between house price and various 4.1. Dataset and Preprocessing
economic variables, such as income, interest rates, construction There are two different data sets namely train dataset and test
costs and labor market variables [8][9][10]. dataset. Both contain numerous variables in terms of features
which were describing a house. Training dataset contains 1460
3. METHODS AND MATERIALS observations for which the sale price of a house is provided.
There are various kinds of regression techniques available to Based on this data, a prediction model is to be built. Test
make predictions [11]. The techniques are mostly driven by dataset contains 1459 observations for which the sales price has
three metrics (number of independent variables, type of to be predicted. 80 variables in total focus on the quality and
dependent variables and shape of the regression line) which is quantity of many physical attributes of the property. Most of
given in figure 1. the variables are exactly the type of information that a typical

© 2019, www.IJARIIT.com All Rights Reserved Page | 371


A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
home buyer would have to know about a potential property. facilities, such as schools, campus, hospitals and health centres,
This study is based on house price data of Ames as well as family recreation facilities such as malls, culinary
Housing dataset. tours, or even offer a beautiful scenery [19], [20].

Some of these features of the dataset don’t have a linear 4.4. XgBoost
relationship with the house price such as ‘date’, ‘long’ and ‘lat’ XGBoost has become a widely used and really popular tool
representing the date the house was sold, the longitude and the among Kaggle competitors and Data Scientists in industry, as it
latitude of the house, respectively. These features should either has been battle tested for production on large-scale problems. It
be removed or modified. First, using ‘date’ (the date the house is a highly flexible and versatile tool that can work through
was sold) and ‘yr built’ (the year the house was built), we most regression, classification and ranking problems as well as
calculate the age of the building. Using the feature ‘yr user-built objective functions. As open-source software, it is
renovated’ (the year the house was renovated) we create a new easy to access and it may be used through different platforms
binary feature to represent whether the house was renovated at and interfaces. The portability and compatibility of the system
all. Although zip-code doesn’t have a linear relation with the permit its usage on all three Windows, Linux and OS X. It also
price, it could have useful information about the house price. supports training on distributed cloud platforms like AWS,
Hence it is treated as a categorical feature. Next, the features Azure, GCE among others and it is easily connected to large-
‘id’, ‘date’, ‘yr built’, ‘lat’, ‘long’, ‘date yr’ and ‘yr renovated’ scale cloud dataflow systems such as Flink and Spark.
are removed. Although it was built and initially used in the Command Line
Interface (CLI) by its creator, it can also be loaded and used in
4.2. Lasso Regression various languages and interfaces such as Python, C++, R, Julia,
In machine learning and statistics, lasso (least absolute Scala and Java.
shrinkage and selection operator; also Lasso or LASSO) is
a regression analysis method that performs both variable XGBoost is an accurate and scalable implementation of
selection and regularization in order to enhance the prediction gradient boosting machines. Its name stands for eXtreme
accuracy and interpretability of the statistical model it Gradient Boosting; it was developed by Tianqi Chen and now it
produces. is part of a wider collection of open-source libraries developed
by the Distributed Machine Learning Community (DMLC). It
Lasso is a powerful regression technique. It works by has proven to push the limits of computing power for boosted
penalizing the magnitude of coefficients of features along with trees algorithms as it was built and developed for the sole
minimizing the error between predicted and actual purpose of computational speed and model performance.
observations. Lasso is called as L1 Regularization technique. Specifically, it was engineered to exploit every bit of a memory
The algorithm can be implemented with the help of python’s and hardware resources for tree boosting algorithms.
SciKit-learn Library [15]. Lasso attempts to minimize the cost
function. The cost function is given as Cost(W)= RSS(W) + α The implementation of XGBoost offers several advanced
(Sum of squares of weight) Here RSS refers to ‘Residual Sum features for tuning of models, computing environments and
of Squares’ meaning the sum of the square of errors between algorithm enhancement. It is capable of performing the three
the predicted and actual values in the training data set. α is a co- main forms of gradient boosting (such as Gradient Boosting
efficient that takes various values. There are three cases for (GB), Stochastic GB and Regularized GB) and it is robust
values of α. enough to support fine-tuning and the addition of regularization
1. α = 0; same coefficients as simple linear regression parameters. According to Tianqi Chen, the latter is what makes
2. α = ∞; All coefficients zero it superior and different from other libraries. System-wise, the
3. 0 < α < ∞; coefficients between 0 and that of simple linear library’s portability and flexibility allow the use of a wide
regression The Lasso function can be variety of computing environments like parallelization for tree
𝑁 𝑀 𝑀 .
construction across several CPU cores; Out-of-Core computing;
2
Cost (w) = ∑{ 𝑦𝑖 − ∑ 𝑤𝑖 𝑥𝑖𝑗 } + 𝛼 ∑ |𝑤𝑖 | distributed computing for large models; and Cache
𝑖=1 𝑗=0 𝑗=0 Optimization to improve hardware usage and efficiency.

The model can solve many of the challenges that we face with The algorithm was developed to efficiently reduce computing
linear regression and can be a very useful tool for fitting linear time and allocate an optimal usage of memory resources.
models. It’s a better way to analyze data and capture Important features of implementation include handling of
relationships in the data and avoid over-fitting. missing values (Sparse Aware), Block Structure to support
parallelization in tree construction and the ability to fit and
4.3. House Price Affecting Factors boost on new data added to a trained model. It holds various
There are several factors that affect house prices. In research methodologies and steps in the prediction method.
[16] the factors affecting the house price are divided into three
main groups, they are physical condition, concept and location. 5. WORKING MODEL
Physical conditions are properties possessed by a house that can
be observed by human senses, including the size of the house,
the number of bedrooms, the availability of kitchen and garage,
the availability of the garden, the area of land and buildings,
and the age of the house [17], while the concept is an idea
offered by developers who can attract potential buyers, for
example, the concept of a minimalist home, healthy and green
environment, and elite environment. Location is an important
factor in shaping the price of a house. This is because the
location determines the prevailing land price [18]. In addition,
the location also determines the ease of access to public Fig. 2: Steps involved for prediction
© 2019, www.IJARIIT.com All Rights Reserved Page | 372
A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
a) Reading data: At this stage, the data is read. The training parameters which are not correlated to each other and are
data is then needed to be concatenated with test data. This is independent in nature and these feature set were then given as
done mainly because of the presence of text variables. These an input. It performs both variable selection and regularization
will later be replaced by dummy variables. If training and test in order to enhance the prediction accuracy.
set is treated separately, it could end up with a different number
of dummy variables for each of them which would in turn 7. REFERENCES
damage the prediction. [1] R. M. A. van der Schaar, Analysis of Indonesian Property
Market; Overview and Foreign Ownership,‖ Investment
b) Data Preprocessing: It is a process of transforming the raw, Indonesian. 2015.
complex data into systematic understandable knowledge. It [2] Y. Feng and K. Jones, Comparing multilevel modelling
involves the process of finding out missing and redundant data and artificial neural networks in house price prediction,‖
in the dataset. The entire dataset is checked for Na and 2015 2nd IEEE Int. Conf. Spat. Data Min. Geogr. Knowl.
whichever observation consists of Na will be deleted. Thus, this Serv., pp. 108–114, 2015.
brings uniformity in the dataset. Finally, the data has to be split [3] Rochard J. Cebula. “The Hedonic Pricing Model Applied
into training and test data. to the Housing Market of the City of Savannah and Its
Savannah Historic Landmark District”. In: The Review of
c) Data Analysis: Before applying any model to our dataset, Regional Studies 39.1 (2009), pp. 9–22.
we need to find out the characteristics of our dataset. Thus, we [4] [Gang-Zhi Fan, Seow Eng Ong, and Hian Chye Koh.
need to analyze our dataset and study the different parameters “Determinants of House Price: A Decision Tree
and relationship between these parameters. We can also find Approach”. In: Urban Studies 43.12 (2006)
out the outliers present in our dataset. Outliers occur due to [5] Gu Jirong, Zhu Mingcang, and Jiang Liuguangyan.
some kind of experimental errors and they need to be excluded “Housing price based on genetic algorithm and support
from the dataset. vector machine”. In: Expert Systems with Applications 38
(2011), pp. 3383–3386.
d) Feature Engineering: Feature (variable or predictor) [6] Eric Slone, Haitian Sun, Po-Hsiang Wang, (2014), “Market
engineering is one of the most important steps in model Prices of Houses in Atlanta”, from
creation. Often there is valuable information “hidden” in the https://smartech.gatech.edu/bitstream/handle/1853/51632/
predictors that are only revealed when manipulating these Market%20Prices%20of%20Houses%20in%20Atlanta.pdf
features in some way. Below are just some examples of the [7] P. Linneman, An empirical test of the efficiency of the
features: housing market‖. Journal of Urban Economics 20(1986):
 Remodeled (categorical): Yes or No if Year Built is 140-154, 1986.
different from Year Remodeled; if the year the house was [8] J.M. Quigley, Real estate prices and economic cycles‖.
remodeled is different from the year it was built, the International Real Estate Reviews 2: 1-20. 1999.
remodeling likely increases property value. [9] K.Tsatasaronis, & H. Zhu, What drives housing price
 Seasonality (categorical): Combined Month Sold with Year dynamics: Cross-country evidence?‖ BIS Quarterly Review
Sold; while more houses were sold during summer months, of March.
this likely varies across years, especially during the time [10] Torgo, Luis, and Joao Gama. "Regression using
period these houses were sold, which coincides with the classification algorithms." Intelligent Data Analysis 1.4
housing crash. (1997): 275-2.
 New House (categorical): Yes or No if Year Sold is equal [11] Ezgi Candas, Seda Bagdatli Kalkan and Tahsin
to Year Built; if a house was sold the same year it was Yomralioglu, (2015), “Determining the Factors Affecting
built, we might expect it was in high demand and might Housing Prices”, FIG Working Week 2015, Sofia,
have a higher Sale Price. Bulgaria, 17 - 21 May 2015.
 Total Area (continuous): Sum of all variables that describe [12] Razi, Muhammad A., and KuriakoseAthappilly. "A
the area of different sections of a house; There are many comparative predictive analysis of neural networks (NNs),
variables that pertain to the square footage of different nonlinear regression and classification and regression tree
aspects of each house; we might expect that the total (CART) models." Expert Systems with Applications 29.1
square footage has a strong influence on Sale Price. (2005): 65-74.
[13] Lenk M. M., Worzala E. M. and A. Silva, 1997, “High-
e) Modelling: Model selection is the process of combining data tech Valuation: Should Artificial Neural Networks Bypass
and prior information to select among a group of statistical The Human Valuer?”, Journal of Property Valuation &
models. In building a model, decisions to include or exclude Investment, 15(1): 8 – 26.
covariates, as well as uncertainty in how to code the covariates [14] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning
in the design matrix for any given model, are based both on the in Python." Journal of machine learning research 12.Oct
prior hypotheses and the data. Lasso (least absolute shrinkage (2011): 2825-2830.
and selection operator; also Lasso or LASSO) is a regression [15] R. A. Rahadi, S. K. Wiryono, D. P. Koesrindartotoor, and
analysis method that performs both variable I. B. Syamwil, Factors influencing the price of housing in
selection and regularization in order to enhance the prediction Indonesia,‖ Int. J. Hous. Mark. Anal., vol. 8, no. 2, pp.
accuracy and interpretability of the statistical model it 169–188, 2015.
produces. [16] V. Limsombunchai, House price prediction: Hedonic price
model vs. artificial neural network,‖ Am. J. …, 2004.
6. CONCLUSION [17] D. X. Zhu and K. L. Wei, The Land Prices and Housing
In this paper, the LASSO regression technique was Prices Empirical Research Based on Panel Data of 11
implemented to predict the price of a house. The step by step Provinces and Municipalities in Eastern China,‖ Int. Conf.
procedure to analyze the dataset and find the correlation Manag. Sci. Eng., no. 2009, pp. 2118–2123, 2013.
between the parameters are mentioned. Thus we can select the
© 2019, www.IJARIIT.com All Rights Reserved Page | 373
A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
[18] S. Kisilevich, D. Keim, and L. Rokach, ―A GIS-based [19] C. Y. Jim and W. Y. Chen, ―Value of scenic views:
decision support system for hotel room rate estimation and Hedonic assessment of private housing in Hong Kong,‖
temporal price prediction: The hotel brokers’ context,‖ Landsc. Urban Plan., vol. 91, no. 4, pp. 226–234, 2009.
Decis. Support Syst., vol. 54, no. 2, pp. 1119– 1133, 2013.

© 2019, www.IJARIIT.com All Rights Reserved Page | 374

You might also like