Professional Documents
Culture Documents
ISSN: 2454-132X
Impact factor: 4.295
(Volume 5, Issue 1)
Available online at: www.ijariit.com
Predicting housing prices using advanced regression techniques
Bharathi A. N. Dr. N. Yuvaraj Dhivya B.
bharathinandhees1997@gmail.com drnyuvaraj@gmail.com dhivyakrishnan1998@gmail.com
KPR Institute of Engineering and KPR Institute of Engineering and KPR Institute of Engineering and
Technology, Coimbatore, Tamil Nadu Technology, Coimbatore, Tamil Nadu Technology, Coimbatore, Tamil Nadu
Some of these features of the dataset don’t have a linear 4.4. XgBoost
relationship with the house price such as ‘date’, ‘long’ and ‘lat’ XGBoost has become a widely used and really popular tool
representing the date the house was sold, the longitude and the among Kaggle competitors and Data Scientists in industry, as it
latitude of the house, respectively. These features should either has been battle tested for production on large-scale problems. It
be removed or modified. First, using ‘date’ (the date the house is a highly flexible and versatile tool that can work through
was sold) and ‘yr built’ (the year the house was built), we most regression, classification and ranking problems as well as
calculate the age of the building. Using the feature ‘yr user-built objective functions. As open-source software, it is
renovated’ (the year the house was renovated) we create a new easy to access and it may be used through different platforms
binary feature to represent whether the house was renovated at and interfaces. The portability and compatibility of the system
all. Although zip-code doesn’t have a linear relation with the permit its usage on all three Windows, Linux and OS X. It also
price, it could have useful information about the house price. supports training on distributed cloud platforms like AWS,
Hence it is treated as a categorical feature. Next, the features Azure, GCE among others and it is easily connected to large-
‘id’, ‘date’, ‘yr built’, ‘lat’, ‘long’, ‘date yr’ and ‘yr renovated’ scale cloud dataflow systems such as Flink and Spark.
are removed. Although it was built and initially used in the Command Line
Interface (CLI) by its creator, it can also be loaded and used in
4.2. Lasso Regression various languages and interfaces such as Python, C++, R, Julia,
In machine learning and statistics, lasso (least absolute Scala and Java.
shrinkage and selection operator; also Lasso or LASSO) is
a regression analysis method that performs both variable XGBoost is an accurate and scalable implementation of
selection and regularization in order to enhance the prediction gradient boosting machines. Its name stands for eXtreme
accuracy and interpretability of the statistical model it Gradient Boosting; it was developed by Tianqi Chen and now it
produces. is part of a wider collection of open-source libraries developed
by the Distributed Machine Learning Community (DMLC). It
Lasso is a powerful regression technique. It works by has proven to push the limits of computing power for boosted
penalizing the magnitude of coefficients of features along with trees algorithms as it was built and developed for the sole
minimizing the error between predicted and actual purpose of computational speed and model performance.
observations. Lasso is called as L1 Regularization technique. Specifically, it was engineered to exploit every bit of a memory
The algorithm can be implemented with the help of python’s and hardware resources for tree boosting algorithms.
SciKit-learn Library [15]. Lasso attempts to minimize the cost
function. The cost function is given as Cost(W)= RSS(W) + α The implementation of XGBoost offers several advanced
(Sum of squares of weight) Here RSS refers to ‘Residual Sum features for tuning of models, computing environments and
of Squares’ meaning the sum of the square of errors between algorithm enhancement. It is capable of performing the three
the predicted and actual values in the training data set. α is a co- main forms of gradient boosting (such as Gradient Boosting
efficient that takes various values. There are three cases for (GB), Stochastic GB and Regularized GB) and it is robust
values of α. enough to support fine-tuning and the addition of regularization
1. α = 0; same coefficients as simple linear regression parameters. According to Tianqi Chen, the latter is what makes
2. α = ∞; All coefficients zero it superior and different from other libraries. System-wise, the
3. 0 < α < ∞; coefficients between 0 and that of simple linear library’s portability and flexibility allow the use of a wide
regression The Lasso function can be variety of computing environments like parallelization for tree
𝑁 𝑀 𝑀 .
construction across several CPU cores; Out-of-Core computing;
2
Cost (w) = ∑{ 𝑦𝑖 − ∑ 𝑤𝑖 𝑥𝑖𝑗 } + 𝛼 ∑ |𝑤𝑖 | distributed computing for large models; and Cache
𝑖=1 𝑗=0 𝑗=0 Optimization to improve hardware usage and efficiency.
The model can solve many of the challenges that we face with The algorithm was developed to efficiently reduce computing
linear regression and can be a very useful tool for fitting linear time and allocate an optimal usage of memory resources.
models. It’s a better way to analyze data and capture Important features of implementation include handling of
relationships in the data and avoid over-fitting. missing values (Sparse Aware), Block Structure to support
parallelization in tree construction and the ability to fit and
4.3. House Price Affecting Factors boost on new data added to a trained model. It holds various
There are several factors that affect house prices. In research methodologies and steps in the prediction method.
[16] the factors affecting the house price are divided into three
main groups, they are physical condition, concept and location. 5. WORKING MODEL
Physical conditions are properties possessed by a house that can
be observed by human senses, including the size of the house,
the number of bedrooms, the availability of kitchen and garage,
the availability of the garden, the area of land and buildings,
and the age of the house [17], while the concept is an idea
offered by developers who can attract potential buyers, for
example, the concept of a minimalist home, healthy and green
environment, and elite environment. Location is an important
factor in shaping the price of a house. This is because the
location determines the prevailing land price [18]. In addition,
the location also determines the ease of access to public Fig. 2: Steps involved for prediction
© 2019, www.IJARIIT.com All Rights Reserved Page | 372
A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
a) Reading data: At this stage, the data is read. The training parameters which are not correlated to each other and are
data is then needed to be concatenated with test data. This is independent in nature and these feature set were then given as
done mainly because of the presence of text variables. These an input. It performs both variable selection and regularization
will later be replaced by dummy variables. If training and test in order to enhance the prediction accuracy.
set is treated separately, it could end up with a different number
of dummy variables for each of them which would in turn 7. REFERENCES
damage the prediction. [1] R. M. A. van der Schaar, Analysis of Indonesian Property
Market; Overview and Foreign Ownership,‖ Investment
b) Data Preprocessing: It is a process of transforming the raw, Indonesian. 2015.
complex data into systematic understandable knowledge. It [2] Y. Feng and K. Jones, Comparing multilevel modelling
involves the process of finding out missing and redundant data and artificial neural networks in house price prediction,‖
in the dataset. The entire dataset is checked for Na and 2015 2nd IEEE Int. Conf. Spat. Data Min. Geogr. Knowl.
whichever observation consists of Na will be deleted. Thus, this Serv., pp. 108–114, 2015.
brings uniformity in the dataset. Finally, the data has to be split [3] Rochard J. Cebula. “The Hedonic Pricing Model Applied
into training and test data. to the Housing Market of the City of Savannah and Its
Savannah Historic Landmark District”. In: The Review of
c) Data Analysis: Before applying any model to our dataset, Regional Studies 39.1 (2009), pp. 9–22.
we need to find out the characteristics of our dataset. Thus, we [4] [Gang-Zhi Fan, Seow Eng Ong, and Hian Chye Koh.
need to analyze our dataset and study the different parameters “Determinants of House Price: A Decision Tree
and relationship between these parameters. We can also find Approach”. In: Urban Studies 43.12 (2006)
out the outliers present in our dataset. Outliers occur due to [5] Gu Jirong, Zhu Mingcang, and Jiang Liuguangyan.
some kind of experimental errors and they need to be excluded “Housing price based on genetic algorithm and support
from the dataset. vector machine”. In: Expert Systems with Applications 38
(2011), pp. 3383–3386.
d) Feature Engineering: Feature (variable or predictor) [6] Eric Slone, Haitian Sun, Po-Hsiang Wang, (2014), “Market
engineering is one of the most important steps in model Prices of Houses in Atlanta”, from
creation. Often there is valuable information “hidden” in the https://smartech.gatech.edu/bitstream/handle/1853/51632/
predictors that are only revealed when manipulating these Market%20Prices%20of%20Houses%20in%20Atlanta.pdf
features in some way. Below are just some examples of the [7] P. Linneman, An empirical test of the efficiency of the
features: housing market‖. Journal of Urban Economics 20(1986):
Remodeled (categorical): Yes or No if Year Built is 140-154, 1986.
different from Year Remodeled; if the year the house was [8] J.M. Quigley, Real estate prices and economic cycles‖.
remodeled is different from the year it was built, the International Real Estate Reviews 2: 1-20. 1999.
remodeling likely increases property value. [9] K.Tsatasaronis, & H. Zhu, What drives housing price
Seasonality (categorical): Combined Month Sold with Year dynamics: Cross-country evidence?‖ BIS Quarterly Review
Sold; while more houses were sold during summer months, of March.
this likely varies across years, especially during the time [10] Torgo, Luis, and Joao Gama. "Regression using
period these houses were sold, which coincides with the classification algorithms." Intelligent Data Analysis 1.4
housing crash. (1997): 275-2.
New House (categorical): Yes or No if Year Sold is equal [11] Ezgi Candas, Seda Bagdatli Kalkan and Tahsin
to Year Built; if a house was sold the same year it was Yomralioglu, (2015), “Determining the Factors Affecting
built, we might expect it was in high demand and might Housing Prices”, FIG Working Week 2015, Sofia,
have a higher Sale Price. Bulgaria, 17 - 21 May 2015.
Total Area (continuous): Sum of all variables that describe [12] Razi, Muhammad A., and KuriakoseAthappilly. "A
the area of different sections of a house; There are many comparative predictive analysis of neural networks (NNs),
variables that pertain to the square footage of different nonlinear regression and classification and regression tree
aspects of each house; we might expect that the total (CART) models." Expert Systems with Applications 29.1
square footage has a strong influence on Sale Price. (2005): 65-74.
[13] Lenk M. M., Worzala E. M. and A. Silva, 1997, “High-
e) Modelling: Model selection is the process of combining data tech Valuation: Should Artificial Neural Networks Bypass
and prior information to select among a group of statistical The Human Valuer?”, Journal of Property Valuation &
models. In building a model, decisions to include or exclude Investment, 15(1): 8 – 26.
covariates, as well as uncertainty in how to code the covariates [14] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning
in the design matrix for any given model, are based both on the in Python." Journal of machine learning research 12.Oct
prior hypotheses and the data. Lasso (least absolute shrinkage (2011): 2825-2830.
and selection operator; also Lasso or LASSO) is a regression [15] R. A. Rahadi, S. K. Wiryono, D. P. Koesrindartotoor, and
analysis method that performs both variable I. B. Syamwil, Factors influencing the price of housing in
selection and regularization in order to enhance the prediction Indonesia,‖ Int. J. Hous. Mark. Anal., vol. 8, no. 2, pp.
accuracy and interpretability of the statistical model it 169–188, 2015.
produces. [16] V. Limsombunchai, House price prediction: Hedonic price
model vs. artificial neural network,‖ Am. J. …, 2004.
6. CONCLUSION [17] D. X. Zhu and K. L. Wei, The Land Prices and Housing
In this paper, the LASSO regression technique was Prices Empirical Research Based on Panel Data of 11
implemented to predict the price of a house. The step by step Provinces and Municipalities in Eastern China,‖ Int. Conf.
procedure to analyze the dataset and find the correlation Manag. Sci. Eng., no. 2009, pp. 2118–2123, 2013.
between the parameters are mentioned. Thus we can select the
© 2019, www.IJARIIT.com All Rights Reserved Page | 373
A. N. Bharathi et al.; International Journal of Advance Research, Ideas and Innovations in Technology
[18] S. Kisilevich, D. Keim, and L. Rokach, ―A GIS-based [19] C. Y. Jim and W. Y. Chen, ―Value of scenic views:
decision support system for hotel room rate estimation and Hedonic assessment of private housing in Hong Kong,‖
temporal price prediction: The hotel brokers’ context,‖ Landsc. Urban Plan., vol. 91, no. 4, pp. 226–234, 2009.
Decis. Support Syst., vol. 54, no. 2, pp. 1119– 1133, 2013.