Professional Documents
Culture Documents
Housing
Housing
Postgraduate Programme
1
Table of Contents
Abstract.......................................................................................................................................................1
List of Figures.............................................................................................................................................5
List of Tables...............................................................................................................................................6
Chapter 1.....................................................................................................................................................7
Introduction.................................................................................................................................................7
1.1. Background..................................................................................................................................7
Chapter 2...................................................................................................................................................13
Literature Review......................................................................................................................................13
2.1. History.......................................................................................................................................13
2.3. Summary....................................................................................................................................16
2.4. Arguments.................................................................................................................................17
2.5. Conclusion.................................................................................................................................18
Chapter 3...................................................................................................................................................19
Methodology.............................................................................................................................................19
3.1. Tools..........................................................................................................................................19
3.2. Technology................................................................................................................................19
2
3.4. System Design...........................................................................................................................20
Chapter 4...................................................................................................................................................26
Chapter 5...................................................................................................................................................33
3
5.3. Optimum Hyper parameters.......................................................................................................35
Chapter 6...................................................................................................................................................42
6.1. Conclusion.................................................................................................................................42
6.2. Limitations.................................................................................................................................43
References.................................................................................................................................................45
Appendix...................................................................................................................................................48
Appendix A...........................................................................................................................................48
Appendix B............................................................................................................................................57
4
List of Figures
Figure 1 Project Constraints......................................................................................................................11
Figure 2 Project Planning..........................................................................................................................20
Figure 3 System Architecture....................................................................................................................21
Figure 4 Histogram plot of Class Variable “price”....................................................................................28
Figure 5 Boxplot for dataset......................................................................................................................29
Figure 6 Correlation heat map for dataset..................................................................................................30
Figure 7 Histogram plot of Predictor Variable “area”................................................................................31
Figure 8 Relationship between price and area...........................................................................................31
Figure 9 Frequency Plot of Predictor Variable “bathrooms”.....................................................................32
Figure 10 Relationship between price and bathrooms...............................................................................32
Figure 11 Data Transformation..................................................................................................................34
Figure 12 Fit models without hyper parameters.........................................................................................36
Figure 13 Fit models with hyper parameters..............................................................................................37
Figure 14 Residual and Q-Q plot for linear Regression.............................................................................40
Figure 15 Residual and Q-Q plot for ridge Regression..............................................................................40
Figure 16 Residual and Q-Q plot for lasso Regression..............................................................................41
Figure 17 Residual and Q-Q plot for Random Forest Regression..............................................................41
Figure 18 Residual and Q-Q plot for XgBoost Regression........................................................................41
5
List of Tables
Table 1 Summary Table of Related Works................................................................................................17
Table 2 Important Hyper parameters for OLS regression..........................................................................22
Table 3 Important Hyper parameters for lasso regression..........................................................................22
Table 4 Important Hyper parameters for Random Forest regression.........................................................23
Table 5 Important hyper parameters for XgBoost......................................................................................24
Table 6 Statistical Description of Numeric Variables................................................................................27
Table 7 Statistical Description of Categorical Variables...........................................................................27
Table 8 Optimum parameter values for lasso and ridge regression............................................................35
Table 9 Optimum parameter values for lasso and ridge regression............................................................36
Table 10 Performance of Lasso Regression with and without hyper parameters.......................................37
Table 11 Performance of Ridge Regression with and without hyper parameters.......................................38
Table 12 Performance of Random Regression with and without hyper parameters...................................38
Table 13 Performance of XgBoost Regression with and without hyper parameters..................................38
Table 14 Comparison between performances of the models......................................................................39
6
Chapter 1
Introduction
1.1. Background
Apart from other essential necessities like food, water, and many other things, finding home is
one of a person's most basic needs today. The demand for homes increased quickly over time as
people's standard of living increased. Although some people buy homes as assets or as real
estate, the majority of people purchase homes for living or as a means of support.
Real estate markets positively affect a country's currency claims to be a key factor in a country’s
economic growth. In order to meet housing demand, owners will buy items like home
furnishings and appliances, and construction firms or builders will buy building materials, which
is a sign of the economic wave effect brought on by the new housing stock. In addition,
customers have the resources to make a sizable investment, and a country's abundant housing
stock indicates that the construction sector is in great shape (Temur et al., 2019)
Multiple global agencies and humanitarian organizations have stressed the value of homes. Each
country's economic, legislative, and commercial systems are deeply ingrained with the Houses
(Ebekozien et al., 2019). However, Jafari and Akhavian (2019) claimed that the fluctuations of
housing costs has been a problem for householders, constructions, and estate development, and
Choong (2018) stated that homes are becoming expensive due to the significant rising inflation
in the residential sector in various regions. The possible rise in home prices affects both the
economic development and the standard of living of homeowner. In the end, investors who are
using their home as an asset will be impacted by this problem.
Investments in the real estate frequently seem to be beneficial since the valuations of the
properties remain constant. Different household owners, financiers, policy makers, and many
others may be impacted by changes in the property price. The real estate sector seems like a good
opportunity to invest money. The bulk of people are captivated by the business transaction of
funding in current era of globalization. Gold, shares, and real estate are just a few of the items
that are usually employed as investments. In particularly, both demand for and the sale of
residential housing has risen drastically since 2011 globally (Glaeser et al., 2017).
7
Machine learning is a branch of AI technology which utilizes algorithms and techniques to gain
data-driven information. Big data is an area where machine learning techniques are applicable
since it would be hard to personally analyze such massive amounts of data. In computer science,
machine learning makes an effort to find algorithmic solutions for problems rather than solely
mathematical ones. Furthermore, machine learning can be divided into two broad categories:
supervised learning and unsupervised learning. In Supervised Learning, the algorithm is trained
on a predefined list so that it can make predictions when new data is provided. In unsupervised
learning, the algorithms look for relationships and discover trends among the data (Simon &
Singh 2015).
Today, a variety of Machine Learning algorithms are employed to resolve problems that arise in
the real world. Some of them perform better under particular conditions. So, this thesis
undertakes the task to assess the performance of regression algorithms in terms of predicting
results from a given dataset. In this project, a regression model has been developed that predicts
the prices of houses more accurately using machine learning algorithms. Since the prediction in
many regression methods depends not only on a particular trait but also on an unknown number
of factors that lead to the result to be expected, the performance will be evaluated by estimating
property prices. Prices for houses vary depending on their specific features. Houses contain a
variety of characteristics, some of which may not cost the same amount depending on where they
are. For example, the price of a large house may be higher if it is put in a prestigious wealthy
neighborhood rather than a poor one. The dataset “Housing Price” that will be used in this
project is taken from the publically available Kaggle Website (Housing Prices Dataset, 2022).
The detail of the dataset will be provided in the next chapter.
Better methodologies for determining house prices are required because of the property market's
fluctuation. As a result, the prediction accuracy of housing models has drawn considerable
8
interest from academics and been the subject of in-depth research. We are extremely keen in
techniques that can gauge a home's worth based on its features in contrast to the going rate for
comparable properties. Anyone purchasing or selling a property, as well as investors choosing
their investment strategy, depend on this ability to predict housing prices. Therefore, we choose
to study the house prices predicting problem on Kaggle, which enables us to dig into the
variables in depth and to provide a model that could more accurately estimate home prices.
Given the various problems, prior research has demonstrated that it is possible to determine, at
least to some extent, the ultimate price of a property. In the next chapter, we'll talk about some of
the previous research. Our objectives in this project are to evaluate the predictive performance
from various models and provide insights as to how specific features affect housing prices given
the limited data provided. To maintain the transparency among customers and also the
comparison can be made easy through this model. Investors, home buyers, and homeowners
might find this information beneficial. If customer finds the price of house at some given website
higher than the price predicted by the model, he can reject that house. In this way, people could
make better decisions when it comes to home investment.
The House Price Index (HPI) is commonly used to predict changes in the price of residential real
estate in several countries. Because it is a weighted repeat sales index, the HPI examines average
price changes in repeat sales or refinancing on the same properties. Using various analytical
techniques, housing economists can forecast changes in the rates of mortgage defaults,
9
prepayments, and housing affordability in specific geographic locations (Index, 2015). Because
HPI is a broad indication obtained from all transactions, it is useless for projecting the price of a
specific dwelling. Since it is well known that numerous factors, such location, house type,
construction year, etc., affect the price of a property, understanding the factors that greatly affect
a home's price is essential in addition to getting accurate projections.
1. Conduct a deep research from the Logistic Regression for the investigation of current
situation.
2. Finding a proper dataset that can be used to evaluate the house price prediction.
3. To apply data preprocessing methods in order to get clean data.
4. Conduct a deep research to investigate the variety of machine learning (ML) regression
algorithms used to predict house prices and then choose the model that has the highest
accuracy score.
5. To assess the effectiveness of machine learning models for price prediction housing price
dataset.
6. To create a user-friendly, manpower saving approach for anticipating home prices.
7. Real estate business, the buyer and the seller can decide when the best time to buy a
home is.
8. Testing the model for accuracy and write the conclusive advice
1. What are the important features that affect the prices of the houses?
2. Which machine learning model is best to predict the housing prices? And Why?
3. Which performance metrics are useful to evaluate the ML models?
4. From the possible developing which is a proper model to evaluate the price prediction?
10
1.6. Project Constraints
The project has three major constraints which are as follow:
1. Scope
To achieve the aim of the project on of the major component is the proper dataset. The Housing
Price dataset has been downloaded in CSV format from the Kaggle. In order to analyze and
visualize the dataset, python data visualization techniques are used. After extracting insights
from the dataset, find the relevant features having a significant impact on the price changes. The
irrelevant values are excluded. The training set and testing set are the two components of the
dataset. The training set is used to train the different ML regression models. The performance of
each ML Regression model is then evaluated using the testing set. All the models' accuracy
scores and Root Mean Square Errors are evaluated. In the last, the Housing prices are predicted
using the best model having less RMSE.
2. Cost
The constraints of this project are negligible because no hardware components are involved.
Therefore, the project's requirements are relatively inexpensive. Computers with high processing
power including RAM and Hard disks are needed for machine learning algorithms. The
Anaconda Prompt which is a Python machine learning platform is installed to ensure the efficient
11
operation of all machine learning models. Numpy, Pandas, Sklearn, Matplotlib as well
as seaborn are the Python libraries that need to be installed on the platform.
3. Time
The length of time needed to complete a project depends on its scope and the number of
components it contains. The components will need to be deployed on the relevant
workstation around three to four months, according the system's existing requirements.
12
Chapter 2
Literature Review
The study contributes to the increasing amount of information about machine learning in the
housing market. Numerous studies have been done on how to predict property prices, including
those using exponential smoothing and traditional regression models. In this Chapter, only
present the literatures relevant machine learning based housing price prediction. Numerous
aspects have been taken into consideration during the comprehensive study of house price as
well as examine relevant works on this topic.
2.1. History
For the purpose of modeling home prices and real estate values, numerous studies have been
conducted. Hedonic regression, which was first created in the 1960s, has been the most often
used method since it enables the breakdown of total housing spending into the values of its
individual components. The widely used technique of hedonic modeling assumes that a
commodity is a heterogeneity item that may be divided into features like internal structures and
local geographical elements. Hedonic regression is used to quantify the association
between prices and house features, as well as nearby assets.
Wang and Wu (2018) developed a house price estimation model using Random forests. Random
forests model provides a better prediction overall than that of the standard linear regression
model and can better capture latent non-linear relationships between the pricing and attributes of
houses. Researcher carried out quantitative experiment on the North Virginia housing market
records to firmly support the findings of this study.
13
By carefully considering data processing, feature engineering, and combined prediction, Fan et
al. (2018) offered a house prices prediction model in Ames and Iowa based on the data set
prepared by "D. D. Cock" (De Cock, 2011) and the contest organized by "kaggle.com". After
taking the logarithmic of all the testing dataset, the Root Mean Square Error (RMSE)
was 0.12019, demonstrating strong performance and a low degree of over-fitting.
In the study of Varma et al., (2018) real factors are used to predict prices of the houses. The
researchers strived to base the assessments on each fundamental factor taken into account when
calculating the price. In this method, the researchers employed a variety of regression techniques.
The results are not solely determined by one method, but rather by the weighted mean of many
approaches, which provides the results that are the most accurate. The outcomes demonstrated
that this strategy produces least error and highest accuracy when compared to applying single
algorithms. The researchers also suggested using Google Maps to achieve precise valuations by
leveraging true location information.
In Australia, Phan (2018) employed various machine learning techniques on past real estate
transactions data in order to find best models that can be helpful to home buyers and sellers. The
significant price disparity between homes in Melbourne's most costly and least expensive areas
was made clear in this study. Additionally, the results showed that the Stepwise and Support
Vector Machine (SVM) combo, which is based on mean squared error (MSE) assessment, is a
competitive strategy.
To make predictions, Madhuri et al. (2019) employed a variety of regression approaches, such as
"multiple linear", "ridge", "LASSO", "gradient boosting", and "ada boost regression". The
effectiveness of each of the aforementioned methods has been tested on a data set in order to
forecast home prices. This study's goal was to help sellers estimate a property's selling price
precisely and to help readers determine the exact timeline for buying a home. Physical
conditions, concept, location, and other associated aspects that affect cost were also taken into
account in this study.
Sharma et al. (2020) developed a website where users could enter property information to predict
the housing prices, a date to predict prices up to that point, and a price range to offer the best
locations. This project relies on two datasets, one of which contains certain details of housing
14
sales in Mumbai, and the other of which includes the Mumbai house price index (HPI). The
current house price was predicted using a variety of feature selection and extraction techniques,
along with Multiple Linear Regression and the ARIMA model for price prediction. A content
based recommendation system was also used to suggest the best location within the area of
concern for their budget.
In the study, Thamarai and Malarvizhi (2020) used the characteristics of the homes, including the
number of bedrooms, the age of the building, the convenience of transportation from the site, and
the separation from schools and shopping areas. The proposed approach simulates housing
supply based on desired house characteristics and projected house prices. The West Godavari
region of Andhra Pradesh is where the model was created for a tiny town. The work involved
decision tree classification, decision tree regression, and multiple linear regressions using
machine learning techniques.
Sivasankar et al., (2020) discussed about the machine learning algorithm used to estimate the
future house prices. The researchers evaluated and investigated several prediction algorithms in
order to choose the best one. The past market trends, price ranges, and upcoming developments
was examined in order to anticipate the future price. A method for predicting housing values was
required because they increase yearly. The researchers used a variety of machine learning
regression algorithms, including Lasso Regression, Ridge Regression, Ada Boosting Regression,
XGBoost Regression, Decision Tree Regression, and Random Forest Regression, to construct a
model for predicting housing costs. All of the aforementioned strategies have been applied to a
data set in order to anticipate house prices and determine which is the most effective.
Khanum et al., (2021) developed a prediction model for evaluating the price based on the price-
affecting variables. Some supervised learning methods, including Bayesian classifiers or KNN
algorithms, are used in this study. The authors used these models to build a predictive model and
chose the top performing model by conducting a comparative analysis on the prediction errors
obtained between various models. A predictive model was created in an effort to assess the
pricing based on a variety of features. The writers developed this concept as a useful application
that would benefit both buyers and sellers in the real estate market.
15
In order to estimate real estate prices, Ho et al., (2021) employed three ML algorithms. The
algorithms were applied to the data set of housing transactions of 18 years in Hong Kong to
evaluate the performance of the models. When compared to Support Vector Machine (SVM), the
Random Forest (RF) and Gradient Boosting (GB) models have demonstrated superior
performance. Vignesh described the types of information that affect home prices and constructed
an accurate regression model utilizing tree-based algorithms. The researchers employed the
model on the publically available Ames House Price dataset on Kaggle.
2.3. Summary
Sr No. Dataset Authors Date Models
1 NJOP houses data Alfiyatin, Febrita, 2017 Combination of Regression
Taufiq, and and particle swam
Mahmudy optimization
2 North Virgina Wang and Wu 2018 Random Forest
Housing prices
3 D. D. Cook Iomes Fan 2018 Logistic Regression
dataset
4 Melbourne Housing Phan 2018 Combination of Stepwise
Market and Support Vector
Machine
5 Mumbai Housing Varma, Sarma and 2018 Linear Regression,
market Doshi Forest Regression,
Boosted Regression,
Neural Network
6 USA public output Madhuri, Anuradha 2019 Multiple linear, ridge,
data and Pujitha LASSO, Gradient boosting,
and Ada boost regression
7 Mumbai housing sales Sharma, A., 2020 Multiple Linear Regression
and Mumbai house Sonawale, P., and ARIMA model
price index Ghonasgi, D., and
Patankar
16
8 Andhra Pradesh’s Thamarai and 2020 Decision tree classification,
West Godavari area Malarvizhi decision tree regression, and
housing prices multiple linear regressions.
9 USA public output Sivasankar, Ashok 2020 such Lasso Regression,
data and Madhu Ridge Regression, Ada
Boosting Regression,
XGBoost Regression,
Decision Tree Regression,
and Random Forest
Regression
10 Real Estate Data Khanum, Pawar & 2021 Bayesian classifiers or KNN
Anitha algorithms
11 Ames House Price Ho, Tang & Wong 2021 Support Vector Machine,
the Random Forest and
Gradient Boosting
Regression
Table 1 Summary Table of Related Works
2.4. Arguments
There are numerous property sales advertising platforms, including Zameen.com, Olx and many
more, where properties are offered for sale, purchase, or rental. Unfortunately, there are many
pricing inconsistencies in each of these platforms, and there are certain instances where
comparable properties are priced unevenly, which results in a lack of clarity and reliability.
There is no way to confirm the accuracy of the information, thus buyers may occasionally feel
that a specific listed house's valuation is unjustifiable. In the grand scheme of things, solving this
problem will help both customers and the real estate business since the majority of consumers
view the transaction fees to be extremely expensive. Proper assessments and valid prices of
properties can restore a great deal of confidence and transparency to the real estate industry.
As a result of the growing trend toward Big Data, machine learning has lately emerged as a
crucial prediction approach since it can anticipate property prices more precisely based on their
features, regardless of the data from previous years. Many studies examined this topic and
17
showed how successful machine learning algorithms were, but most of them compared the model
performances without taking the combination of different machine learning algorithms into
account.
2.5. Conclusion
The majority of the literature review is based on full-text articles that are accessible online, open
access articles, Google scholars, and publications like Research Gate. The goal of the literature
review is to provide strong foundations for machine learning's regression techniques, as well as
how they can be precisely used to predict prices of houses. The review of related studies and the
feature engineering techniques applied in this study are provided in the literature study. In
addition, the evaluation measures that are employed to assess the effectiveness of the algorithms
are also studied as well as the variables that were applied to the local dataset.
Chapter 3
Methodology
The house price prediction can be handled using a variety of tools and technologies. The ones
used in this thesis were selected for their usability and accessibility. In this chapter, the tools,
18
technologies and proposed methods will be discussed. In this chapter, also go through the
methods of regression, as well as relevant concepts and techniques, and how they might be used
to predict house prices for the future. Additionally, the evaluation metric used to assess the
performance of the models will also be discussed.
3.1. Tools
Microsoft Word
Snipping Tool (Screenshot)
Anaconda Prompt
Jupyter Notebook
3.2. Technology
Python is a widely used programming language for machine learning due to its readability and
accessibility. It has a highly robust experience in developing that uses open-source libraries, the
majority of the academic concerns are determined using this standard programming language.
The following Python libraries are used in this thesis to preprocess, visualize and analyze the
data: Pandas, NumPy, and Matplotlib. Skearn library provides for the implementation of the
prediction models. Anaconda, which includes the Jupyter Notebook software are used. It
includes all of Python's most recent and updated libraries, which will be really helpful for putting
a machine learning technique into practice.
19
3.3. Project Planning
First is to gather information about housing price from online repository. In this data some
features of house and one target variable that’s "Price." The information gathered must be
accurately categorized and organized. When begin to solve any machine learning issue must start
with data entail. The dataset must be valid or there will point in data analysis.
2. Data Preprocessing
Our data is cleaned up at this stage. There could be insufficient data in our dataset. Three
possibilities exist for our missing values; first remove the missing information, second one is
remove the entire variable having missing value or replace the values with mean or median.
3. Feature extraction
20
In this phase, the least important and weak correlated variables are dropped. Features are
extracted with high correlation with target variable.
4. Regression modeling
Data is divided into two parts: Testing and training. 80% of the data are used to train the model,
while the remaining 20% are used to test the model. The training set contains the target variable.
Different machine learning regression algorithms are used to train the models.
5. Result
Finally, a test dataset is fed to the trained model, evaluate the models and predict the prices of
houses using the best model among of them.
21
3.5.1. Linear Regression
Linear regression is the simplest and most straightforward prediction method when modeling the
linear relationship between a target variable and independent factors in statistics. OLS regression
model is used when there is only one influential factor. Multiple linear regressions are the
procedure used when there are multiple significant predictors. The multivariate linear regression
is a method that predicts numerous associated target variables as opposed to a single dependent
variable (Zhou, 2020).
Linear predictor functions are used in linear models to predict relationships, with the model's
unknown parameters being inferred from the dataset. The "conditional mean" of the output is
typically considered to be a linear function of the values of the independent variables; the
"conditional median" or another statistical parameter is sometimes employed. While multivariate
models focus on the "joint probability distribution" of these components, linear regression, like
all other forms of regression models, focuses on the "conditional probability distribution" of the
output given the values of the classifiers.
The first regression modeling approach was linear regression that completed in-depth research
and applied in many real-time applications. That is because model with linear relation to their
random variables are simpler to fit than model with non-linear relation to their factors, and also it
is simpler to calculate the statistical features of the obtained parameter estimation (Wikimedia
Foundation, 2022).
22
generates. The lasso estimator was initially developed for least squares modeling, and this
specified situation exposes a lot about the estimator's behavior, including its relations to ridge
regression, optimal features extraction, and therefore known as "soft thresholding". It also
demonstrates that if covariates are linearly related, the estimated coefficients do not necessarily
have to be distinct.
Although lasso optimizer was initially developed for least - square, it is directly applied to a
broad range of predictive methods, such as simple linear models, general estimator equation, and
multivariable logistic regression models. The type of the constraints affects Lasso's capacity to
execute subgroup selections, and it can be interpreted in a number of ways, including in regards
to geometry, Probabilistic analytics, and linear analysis. By changing the model fitting procedure
to just employ a selection of the available covariates in the final model rather than all of them,
Lasso was developed to increase the predictability and applicability of regression models.
The generic method of bootstrap aggregating also known as bagging is applied to tree learners
using the random forest training algorithm. Because the model's variance is reduced while the
bias remains constant, the bootstrapping method improves model performance. In other words,
although the predictions of a single tree are very sensitive to noise in its training set, the average
of multiple trees, assuming the trees are not coupled, is not. Bootstrap sampling is a technique
23
for breaking correlated trees by exposing them to other training sets. Strongly linked trees would
result from merely training many trees on a single training set.
∑ ( x i− x^i )2
x=1
RMSE=
n
In the above formula, x i are the observed values and x^iare the estimated values.
2
R =1−
∑ of Square of residuals
Total ∑ of squares
25
Chapter 4
26
6 2 5 8
Q1 3430000 3600 2 1 1 0
Q3 5740000 6360 3 2 2 1
Basement 2 No 354
Praferea 2 No 417
27
Table 7 Statistical Description of Categorical Variables
28
The distribution of prices variable is positively skewed as shown in Figure 2, which indicates
that it isn't normal distribution. Hardly people can buy highly luxury housing, thus it is sensible
and logical. For model fitting, there is need to transform the prices variable.
Boxplots of the continuous variables demonstrated that the prices and area are right-skewed, with
outliers on the tails of the distribution. This tends to mean that while the majority of residences
are less expensive than the estimate, the sample also includes some expensive homes, which
have an impact on the typical home's value. The median price is a better indicator of usual values
29
because of this. A similar pattern observed for other numeric attributes. We will handle these
outliers in data preprocessing phase.
Stories and air conditioning are two factors that have a minimum 0.4 association to price.
“hotwaterheating” and basement are the two predictor variables having correlation value less
than 0.2 which represent a weak relationship with response variable.
30
4.3.4. Predictor Variable-Area
The area and price have the strongest correlation, 0.54 which is what we found. This variable
indicates the area (square feet) of the house. Frequency distribution plot of variable-area is
shown in below graph.
Only a small number of houses have very big Area, whereas the majority has the small Area.
This variable's distribution isn't uniform. The variable is positively skewed.
Some houses with big areas but modest prices stand out as glaring anomalies shown in below
scatter plot. We cannot remove these extreme cases with ease. This low price of house may be
due to a number of factors. They may be due to fewer facilities, based solely on my analysis of
the furniture status variable. The data is needed to be transformed.
31
Figure 8 Relationship between price and area
The boxplot in Figure 7 depicts the relationship between predictor variable-bathrooms and price.
32
Figure 10 Relationship between price and bathrooms
We discovered a favorable correlation between the bathrooms and price. And rather than being a
linear relationship, it appears to be a quadratic one or something like that as shown in above plot.
This relationship is simple to comprehend. A buyer simply needs to spend a small amount of
money and acquire fewer bathrooms. However, it will be exceedingly challenging and expensive
for the buyer to raise the bathrooms in his home
Chapter 5
33
There are no missing and duplicated values in our dataset but we detected some outlier in the
above section.
The data's normal distribution demonstrates how the data deviates from the mean. The mean is at
the center of a symmetric normal distribution. It is not normally distributed in this data. A
normal distribution of data is possible. A side-by-side comparison of the distributions before and
after the transformation is shown in Figure 2.10. The histogram of price and area in the
Synchronous reference frame subgroup display a positively skewed distribution before being min
max scaled.
34
Figure 11 Data Transformation
The top hyper parameter combinations were first compiled using the random search algorithm.
Then, using the grid search method, narrower ranges containing the hyper parameter values
discovered in the random search were gathered and thoroughly investigated.
35
The final alpha values after grid search cross validation are:
The final hyper parameter values after grid search cross validation are:
36
model.
37
5.5. Models Evaluation
38
Table 12 Performance of Random Regression with and without hyper parameters
After observing the above table it is clearly seen that the linear regression (OLS) and Ridge
regression have “R-square” and “explained variance score” greater than 0.6 that indicates that
models are best to predict the data. But if compare the R-square value and explained variance
score, then the ridge has higher the R-square value and explained variance score than linear
regression which is 0.671 and 0.671 respectively. If observe the value of RMSE, it is seen that
ridge regression has RMSE value more closet to 0 than linear regression which is 0.167. Lasso
Regression performs the worst having values of R-square and explained variance score in
negative, less than 1 as well has the highest RMSE values among all the other models.
39
5.5.7. Residual and Q-Q Plots
The difference between the target variable's actual value and expected value, also known as the
prediction error, is referred to as a residual in the context of regression models. By showing the
difference between residuals on the vertical axis and the dependent variable on the horizontal
axis, the residuals plot allows identifying target areas that may be more or less mistake-prone.
The variance of the repressor's error is commonly examined using the residuals plot. If the points
are skewed about the horizontal axis, a regression model is usually suitable for the data.
The data on a Q-Q plot that closely resembles the center line denotes a normal distribution. If the
points deviate significantly from the line, the regression model needs to be adjusted by including
or excluding additional variables.
The above plots proved that the model is fit to predict the data accurately as the residual plot
displayed that the points are distributed along the horizontal axis and in the Q-Q plot there is a
straight line which represent the normal distribution.
40
The above plots for ridge regression also proved that the model is fit to predict the data
accurately as in this case the residual plot also displayed that the points are distributed along the
horizontal axis and in the Q-Q plot there is a also straight line for train and test points which
represent the normal distribution.
The data is not properly dispersed along the horizontal line and there is not straight line for train
and test that means the data is not normally distributed in Figure 16 Residual and Q-Q plot for
lasso Regression respectively. The observations represented that this model is not fit to predict
the prices of the houses.
The residual and Q-Q plot for Random Forest regression represented is somehow better but not
fit to predict the house prices accurately.
41
Figure 18 Residual and Q-Q plot for XgBoost Regression
The residual and Q-Q plot for XgBoost regression represented is somehow better but not fit to
predict the house prices accurately.
Chapter 6
6.1. Conclusion
The objective of this qualitative study was to create a machine learning model utilizing
information on global changes in housing prices. Housing.csv, secondary data from the Kaggle
Repository, was used in the study. There were no null values in their dataset. The study took
advantage of cutting-edge tools including machine learning techniques to create a predictive
price model that predicted housing price fluctuations in the future. The data was gathered for
cleaning, pre-processing, and analysis from the Kaggle website, an online database.
42
In this work, a predictive model was built that predicted house selling prices using machine
learning methods. This was achieved by selecting limited number of attributes from online
publically available data. The area of the house effect on the sale price was also investigated. The
study found the important price factors include house total area, number of bathrooms,
bedrooms, and air conditioning. The results of the study also show that the size of homes impacts
price more so than the number of bathrooms. Family size is a major consideration for anyone
purchasing a home to live in; one would want to choose a home that can suit their family size.
The thesis also demonstrated how crucial it was for our predictive system to receive information
in order to make precise pricing predictions. The simple available data was used in the study to
create a simple and accurate model.
The thesis used five machine learning methods, including Random Forest, Linear Regression,
Ridge Regression, Lasso Regression, and Extreme Gradient Boosted (XgBoost) Regression to
develop the model that can predict house prices. To evaluate the performance of the model Root
mean square error (RMSE), R-square and explained variance score metrics were used. In order to
achieve the goal of the study, these five models fitted to the house price dataset. After evaluating
the performance of the model, it was seen that linear regression and ridge regression are ideally
fit on the data. They can predict the prices of the houses more accurately that other three
algorithms. But between these two ridges regression is much better than the linear regression
with R-square value of 0.671 and RMSE closer to 0. Among five of them lasso regression
performed the worst with negative R-squared value and explained variance score. This model has
also the largest RMSE compare to others.
The model helps customers and potential real estate investors decide when is the best time to
make investments in the sector or to conduct business. According to the study, the best price
estimation model for the real estate industry was ridge regression.
6.2. Limitations
The thesis's limitations are primarily attributable to the data it uses and its methodological
approach. Our study aims to use simplified machine learning models that are simple to execute to
predict house price rise in the housing market.
43
Different linear regression models might perform in different ways depending on the dataset
used to train them. It is important to evaluate the models' advantages and disadvantages then
apply them appropriately. Models with a propensity for over fitting must be found using more
effective data transformation techniques in the bias and variance balance.
Big data distributions are preferred for statistical challenges. Large datasets enhance the quality
of the study. The data's inconsistency is a major factor in determining the accuracy of the
predictions. Finding a reliable repository with a large number of features in a normally
distributed dataset is the key challenge.
The frequency of errors can be drastically decreased by optimizing the model's parameters. This
section calls for extensive experience dealing with various datasets and utilizing a variety of
model stacks. In each model, the behavior of the learning rate, the number of leaves, and other
optimizations might vary.
44
range of variables that affect house prices. This might also result in more accurate results
for research questions 1.
References
Alfiyatin, A. N., Febrita, R. E., Taufiq, H., & Mahmudy, W. F. (2017). Modeling house price
prediction using regression analysis and particle swarm optimization case study: Malang,
East Java, Indonesia. International Journal of Advanced Computer Science and
Applications, 8(10).
De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression
project. Journal of Statistics Education, 19(3).
Ebekozien, A., Abdul-Aziz, A. R., & Jaafar, M. (2019). Housing finance inaccessibility for low-
income earners in Malaysia: Factors and solutions. Habitat International, 87, 27-35.
45
Fan, C., Cui, Z., & Zhong, X. (2018, February). House prices prediction with machine learning
algorithms. In Proceedings of the 2018 10th International Conference on Machine
Learning and Computing (pp. 6-10).
Fernando, J. (2022, February 8). What is R-squared? Investopedia. Retrieved August 16, 2022,
from https://www.investopedia.com/terms/r/r squared.asp#:~:text=R%2Dsquared
%20values%20range%20from,)%20you%20are%20interested%20in).
Glaeser, E., Huang, W., Ma, Y., & Shleifer, A. (2017). A real estate boom with Chinese
characteristics. Journal of Economic Perspectives, 31(1), 93-116.
Ho, W. K., Tang, B. S., & Wong, S. W. (2021). Predicting property prices with machine learning
algorithms. Journal of Property Research, 38(1), 48-70.
Housing Prices Dataset. (2022, January 12). Kaggle. Retrieved September 6, 2022, from
https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Jafari, A., & Akhavian, R. (2019). Driving forces for the US residential housing price: a
predictive analysis. Built Environment Project and Asset Management, 9(4), 515-529.
Khanum, F., .P .M , N., Pawar, N., .D, V., & Anitha, R. (2021). Real Estate House Price
Prediction using Machine Learning. International Journal of Engineering Science and
Computing, 11(7), 28452–284524.
Madhuri, C. R., Anuradha, G., & Pujitha, M. V. (2019, March). House price prediction using
regression techniques: a comparative study. In 2019 International conference on smart
structures and systems (ICSSS) (pp. 1-5). IEEE.
Phan, T. D. (2018, December). Housing price prediction using machine learning algorithms: The
case of Melbourne city, Australia. In 2018 International conference on machine learning
and data engineering (iCMLDE) (pp. 35-42). IEEE.
46
Sharma, A., Sonawale, P., Ghonasgi, D., & Patankar, S. (2022, May). House price prediction
forecasting and recommendation system using machine learning. International Research
Journal of Engineering and Technology 7(5) 1540-1550
Simon, A., & Singh, M. (2015). An overview of M learning and its Ap. International Journal of
Electrical Sciences Electrical Sciences & Engineering (IJESE), 22.
Sivasankar, B., Ashok, P. A., N., Madhu, G., & S. F. (2020, July). House Price Prediction.
International Journal of Engineering Science and Computing, 8(7), 2347-2693
Temur, A. S., Akgün, M., & Temur, G. (2019). Predicting housing sales in Turkey using
ARIMA, LSTM and hybrid models.
Thamarai, M., & Malarvizhi, S. P. (2020). House Price Prediction Modeling Using Machine
Learning. International Journal of Information Engineering & Electronic Business, 12(2).
Varma, A., Sarma, A., Doshi, S., & Nair, R. (2018, April). House price prediction using machine
learning and neural networks. In 2018 second international conference on inventive
communication and computational technologies (ICICCT) (pp. 1936-1939). IEEE.
Vignesh, M., Vijay, V., Krishna, S., & Sathyamoorthy, K. HOUSE PRICE PREDICTION
USING MACHINE LEARNING BY RANDOM FOREST ALGORITHM.
Wang, C., & Wu, H. (2018). A new machine learning approach to house price estimation. New
Trends in Mathematical Sciences, 6(4), 165-171.
Wikimedia Foundation. (2022, August 11). Linear regression. Wikipedia. Retrieved August 16,
2022, from https://en.wikipedia.org/wiki/Linear_regression
Zhou, Y. (2020). Housing sale price prediction using machine learning algorithms (thesis).
47
Appendix
Appendix A – Data
48
Figure 20 Sample of Last ten rows of Dataset
49
Appendix B – Project Specification
50
51
52
53
54
55
56
57
58
Appendix B
59