Housing

School of Computing
Postgraduate Programme
MSc in Data Analytics

Final Year Project
Abstract
The evaluations of recent house prices are important while making property selections. Financial
institutions use value estimates to determine liability, while home both buyers and sellers rely on
them to determine whether an offering price is reasonable. The researchers and many other
related stakeholders have been concerned in the pattern of housing prices growing steadily or
dropping suddenly. The issue of how property values change has been the subject of numerous
study projects using a variety of approaches and techniques. The main drawbacks of the existing
approach are that prices of the houses are determined without the required estimation of future
market patterns and rising prices. This thesis project examines the problem of fluctuating home
prices. Comparing several house price models and visualizing the results are the ultimate aims of
this thesis. We will analyze and examine housing prices data using Python, a programming
language and software environment, Jupyter notebook for quantitative analysis and to develop
predictive models about the prices. Furthermore, a significant amount of time devote to
analyzing the data because there are 13 significant predictors. Before training a model,
exploratory data analysis perform to analyze the influence of predictor variables on house price.
Correlation Heat map are used to look into how the predictor variables and housing prices are
related. Data transformation applies to handle the outliers present in the variables. Different
regression models like “linear regression”, “lasso regression”, “ridge regression”, “random
forest” and “xgboost” are investigated as well as compared to establish a suitable fit in order to
choose a predictive approach. By looking at the prediction error, we are able to compare and
evaluate the effectiveness of these approaches. Explained variance score, RMSE and R2 are the
evaluation metrics applies to evaluate and compare the performance of the models. Among the
above mentioned five models, the ridge regression proved the best fit model for house price
prediction with R2 = 0.671 while lasso regression performed worst for predicting house price
with negative R-squared score.
1
Table of Contents
Abstract.......................................................................................................................................................1
List of Figures.............................................................................................................................................5
List of Tables...............................................................................................................................................6
Chapter 1.....................................................................................................................................................7
Introduction.................................................................................................................................................7
1.1. Background..................................................................................................................................7
1.2. Needs and Motivations................................................................................................................8
1.3. Problem Statement.......................................................................................................................9
1.4. Aim and Objectives...................................................................................................................10
1.5. Research Questions....................................................................................................................10
1.6. Project Constraints.....................................................................................................................11
1.7. Outline of the Project.................................................................................................................12
Chapter 2...................................................................................................................................................13
Literature Review......................................................................................................................................13
2.1. History.......................................................................................................................................13
2.2. Related Works...........................................................................................................................13
2.3. Summary....................................................................................................................................16
2.4. Arguments.................................................................................................................................17
2.5. Conclusion.................................................................................................................................18
Chapter 3...................................................................................................................................................19
Methodology.............................................................................................................................................19
3.1. Tools..........................................................................................................................................19
3.2. Technology................................................................................................................................19
3.3. Project Planning.........................................................................................................................20
2
3.4. System Design...........................................................................................................................20
3.5. Regression Models.....................................................................................................................21
3.5.1. Linear Regression..............................................................................................................22
3.5.2. Lasso Regression...............................................................................................................22
3.5.3. Random Forest...................................................................................................................23
3.5.4. Extreme Gradient Booster (XGBoost)...............................................................................24
3.6. Performance Metrics..................................................................................................................24
3.6.1. Root Mean Square Error (RMSE)......................................................................................24
3.6.2. Coefficient of Determination (R2)......................................................................................25
3.6.3. Explained Variance Score..................................................................................................25
3.7. Residual Analysis......................................................................................................................25
Chapter 4...................................................................................................................................................26
Dataset Description and Analysis..............................................................................................................26
4.1. Dataset Collection......................................................................................................................26
4.2. Dataset Description....................................................................................................................26
4.3. Data Analysis.............................................................................................................................27
4.3.1. Target Variable-Price.........................................................................................................28
4.3.2. Outlier Detection................................................................................................................28
4.3.3. Correlation Analysis of Variables......................................................................................29
4.3.4. Predictor Variable-Area.....................................................................................................30
4.3.5. Predictor variable-bathrooms.............................................................................................32
Chapter 5...................................................................................................................................................33
Implementation and Results......................................................................................................................33
5.1. Data Preprocessing....................................................................................................................33
5.1.1. Data Encoding....................................................................................................................33
5.1.2. Data Transformation..........................................................................................................33
5.2. Splitting the Data.......................................................................................................................34
3
5.3. Optimum Hyper parameters.......................................................................................................35
5.3.1. Hyper parameters – Linear regression (lasso and ridge)....................................................35
5.3.2. Hyper parameters – Random Forest and xgboost...............................................................35
5.4. Models Building........................................................................................................................36
5.5. Models Evaluation.....................................................................................................................37
5.5.1. Performance metrics..........................................................................................................37
5.5.2. Performance of Lasso Regression......................................................................................37
5.5.3. Performance of Ridge Regression......................................................................................38
5.5.4. Performance of Random Forest Regression.......................................................................38
5.5.5. Performance of XgBoost Regression.................................................................................38
5.5.6. Comparison of performances of the models.......................................................................38
5.5.7. Residual and Q-Q Plots......................................................................................................39
Chapter 6...................................................................................................................................................42
Conclusion and Future Directions.............................................................................................................42
6.1. Conclusion.................................................................................................................................42
6.2. Limitations.................................................................................................................................43
6.3. Future Work...............................................................................................................................43
References.................................................................................................................................................45
Appendix...................................................................................................................................................48
Appendix A...........................................................................................................................................48
Appendix B............................................................................................................................................57
4
List of Figures
Figure 1 Project Constraints......................................................................................................................11
Figure 2 Project Planning..........................................................................................................................20
Figure 3 System Architecture....................................................................................................................21
Figure 4 Histogram plot of Class Variable “price”....................................................................................28
Figure 5 Boxplot for dataset......................................................................................................................29
Figure 6 Correlation heat map for dataset..................................................................................................30
Figure 7 Histogram plot of Predictor Variable “area”................................................................................31
Figure 8 Relationship between price and area...........................................................................................31
Figure 9 Frequency Plot of Predictor Variable “bathrooms”.....................................................................32
Figure 10 Relationship between price and bathrooms...............................................................................32
Figure 11 Data Transformation..................................................................................................................34
Figure 12 Fit models without hyper parameters.........................................................................................36
Figure 13 Fit models with hyper parameters..............................................................................................37
Figure 14 Residual and Q-Q plot for linear Regression.............................................................................40
Figure 15 Residual and Q-Q plot for ridge Regression..............................................................................40
Figure 16 Residual and Q-Q plot for lasso Regression..............................................................................41
Figure 17 Residual and Q-Q plot for Random Forest Regression..............................................................41
Figure 18 Residual and Q-Q plot for XgBoost Regression........................................................................41
5
List of Tables
Table 1 Summary Table of Related Works................................................................................................17
Table 2 Important Hyper parameters for OLS regression..........................................................................22
Table 3 Important Hyper parameters for lasso regression..........................................................................22
Table 4 Important Hyper parameters for Random Forest regression.........................................................23
Table 5 Important hyper parameters for XgBoost......................................................................................24
Table 6 Statistical Description of Numeric Variables................................................................................27
Table 7 Statistical Description of Categorical Variables...........................................................................27
Table 8 Optimum parameter values for lasso and ridge regression............................................................35
Table 9 Optimum parameter values for lasso and ridge regression............................................................36
Table 10 Performance of Lasso Regression with and without hyper parameters.......................................37
Table 11 Performance of Ridge Regression with and without hyper parameters.......................................38
Table 12 Performance of Random Regression with and without hyper parameters...................................38
Table 13 Performance of XgBoost Regression with and without hyper parameters..................................38
Table 14 Comparison between performances of the models......................................................................39
6
Chapter 1
Introduction
1.1. Background
Apart from other essential necessities like food, water, and many other things, finding home is
one of a person's most basic needs today. The demand for homes increased quickly over time as
people's standard of living increased. Although some people buy homes as assets or as real
estate, the majority of people purchase homes for living or as a means of support.
Real estate markets positively affect a country's currency claims to be a key factor in a country’s
economic growth. In order to meet housing demand, owners will buy items like home
furnishings and appliances, and construction firms or builders will buy building materials, which
is a sign of the economic wave effect brought on by the new housing stock. In addition,
customers have the resources to make a sizable investment, and a country's abundant housing
stock indicates that the construction sector is in great shape (Temur et al., 2019)
Multiple global agencies and humanitarian organizations have stressed the value of homes. Each
country's economic, legislative, and commercial systems are deeply ingrained with the Houses
(Ebekozien et al., 2019). However, Jafari and Akhavian (2019) claimed that the fluctuations of
housing costs has been a problem for householders, constructions, and estate development, and
Choong (2018) stated that homes are becoming expensive due to the significant rising inflation
in the residential sector in various regions. The possible rise in home prices affects both the
economic development and the standard of living of homeowner. In the end, investors who are
using their home as an asset will be impacted by this problem.
Investments in the real estate frequently seem to be beneficial since the valuations of the
properties remain constant. Different household owners, financiers, policy makers, and many
others may be impacted by changes in the property price. The real estate sector seems like a good
opportunity to invest money. The bulk of people are captivated by the business transaction of
funding in current era of globalization. Gold, shares, and real estate are just a few of the items
that are usually employed as investments. In particularly, both demand for and the sale of
residential housing has risen drastically since 2011 globally (Glaeser et al., 2017).
7
Machine learning is a branch of AI technology which utilizes algorithms and techniques to gain
data-driven information. Big data is an area where machine learning techniques are applicable
since it would be hard to personally analyze such massive amounts of data. In computer science,
machine learning makes an effort to find algorithmic solutions for problems rather than solely
mathematical ones. Furthermore, machine learning can be divided into two broad categories:
supervised learning and unsupervised learning. In Supervised Learning, the algorithm is trained
on a predefined list so that it can make predictions when new data is provided. In unsupervised
learning, the algorithms look for relationships and discover trends among the data (Simon &
Singh 2015).
Today, a variety of Machine Learning algorithms are employed to resolve problems that arise in
the real world. Some of them perform better under particular conditions. So, this thesis
undertakes the task to assess the performance of regression algorithms in terms of predicting
results from a given dataset. In this project, a regression model has been developed that predicts
the prices of houses more accurately using machine learning algorithms. Since the prediction in
many regression methods depends not only on a particular trait but also on an unknown number
of factors that lead to the result to be expected, the performance will be evaluated by estimating
property prices. Prices for houses vary depending on their specific features. Houses contain a
variety of characteristics, some of which may not cost the same amount depending on where they
are. For example, the price of a large house may be higher if it is put in a prestigious wealthy
neighborhood rather than a poor one. The dataset “Housing Price” that will be used in this
project is taken from the publically available Kaggle Website (Housing Prices Dataset, 2022).
The detail of the dataset will be provided in the next chapter.
1.2. Needs and Motivations

Growing unaffordability of housing has become one of the major challenges for metropolitan
cities around the world. In order to gain a better understanding of the commercialized housing
market we are currently facing, we want to figure out what are the top influential factors of the
housing price. Apart from the more obvious driving forces such as the inflation and the scarcity
of land, there are also a number of variables that are worth looking into.
Better methodologies for determining house prices are required because of the property market's
fluctuation. As a result, the prediction accuracy of housing models has drawn considerable
8
interest from academics and been the subject of in-depth research. We are extremely keen in
techniques that can gauge a home's worth based on its features in contrast to the going rate for
comparable properties. Anyone purchasing or selling a property, as well as investors choosing
their investment strategy, depend on this ability to predict housing prices. Therefore, we choose
to study the house prices predicting problem on Kaggle, which enables us to dig into the
variables in depth and to provide a model that could more accurately estimate home prices.
Given the various problems, prior research has demonstrated that it is possible to determine, at
least to some extent, the ultimate price of a property. In the next chapter, we'll talk about some of
the previous research. Our objectives in this project are to evaluate the predictive performance
from various models and provide insights as to how specific features affect housing prices given
the limited data provided. To maintain the transparency among customers and also the
comparison can be made easy through this model. Investors, home buyers, and homeowners
might find this information beneficial. If customer finds the price of house at some given website
higher than the price predicted by the model, he can reject that house. In this way, people could
make better decisions when it comes to home investment.
1.3. Problem Statement

Sellers and prospective buyers are concerned how correctly the price of homes can be estimated
as well as what features make a particular home more desirable than other homes on the market
in context of the decline in home valuations. A house's price may vary depending on a diverse
range of aspects, including the house's features like location, the demand and availability of
properties on the market. Real estate market of houses is the essential component of the country's
economy. So, it is necessary to predict prices of the houses which are advantageous not just for
buyers but also for economists and real estate salespeople. Housing price forecasting has been a
hot topic in research for some time now. A precise house price prediction is crucial for those
involved in the real estate market such as householders, clients, and real estate brokers, which is
growing rapidly in many nations
The House Price Index (HPI) is commonly used to predict changes in the price of residential real
estate in several countries. Because it is a weighted repeat sales index, the HPI examines average
price changes in repeat sales or refinancing on the same properties. Using various analytical
techniques, housing economists can forecast changes in the rates of mortgage defaults,
9
prepayments, and housing affordability in specific geographic locations (Index, 2015). Because
HPI is a broad indication obtained from all transactions, it is useless for projecting the price of a
specific dwelling. Since it is well known that numerous factors, such location, house type,
construction year, etc., affect the price of a property, understanding the factors that greatly affect
a home's price is essential in addition to getting accurate projections.
1.4. Aim and Objectives

The aim of this thesis is to investigate, build, and implement a linear regression model that
estimates estimated housing prices depending on specific factors. The objectives of this research
project are as follows:
1. Conduct a deep research from the Logistic Regression for the investigation of current
situation.
2. Finding a proper dataset that can be used to evaluate the house price prediction.
3. To apply data preprocessing methods in order to get clean data.
4. Conduct a deep research to investigate the variety of machine learning (ML) regression
algorithms used to predict house prices and then choose the model that has the highest
accuracy score.
5. To assess the effectiveness of machine learning models for price prediction housing price
dataset.
6. To create a user-friendly, manpower saving approach for anticipating home prices.
7. Real estate business, the buyer and the seller can decide when the best time to buy a
home is.
8. Testing the model for accuracy and write the conclusive advice
1.5. Research Questions

With this project, some important research questions intended as given below:
1. What are the important features that affect the prices of the houses?
2. Which machine learning model is best to predict the housing prices? And Why?
3. Which performance metrics are useful to evaluate the ML models?
4. From the possible developing which is a proper model to evaluate the price prediction?
10
1.6. Project Constraints
The project has three major constraints which are as follow:
Figure 1 Project Constraints
1. Scope
To achieve the aim of the project on of the major component is the proper dataset. The Housing
Price dataset has been downloaded in CSV format from the Kaggle. In order to analyze and
visualize the dataset, python data visualization techniques are used. After extracting insights
from the dataset, find the relevant features having a significant impact on the price changes. The
irrelevant values are excluded. The training set and testing set are the two components of the
dataset. The training set is used to train the different ML regression models. The performance of
each ML Regression model is then evaluated using the testing set. All the models' accuracy
scores and Root Mean Square Errors are evaluated. In the last, the Housing prices are predicted
using the best model having less RMSE.
2. Cost
The constraints of this project are negligible because no hardware components are involved.
Therefore, the project's requirements are relatively inexpensive. Computers with high processing
power including RAM and Hard disks are needed for machine learning algorithms. The
Anaconda Prompt which is a Python machine learning platform is installed to ensure the efficient
11
operation of all machine learning models. Numpy, Pandas, Sklearn, Matplotlib as well
as seaborn are the Python libraries that need to be installed on the platform.
3. Time
The length of time needed to complete a project depends on its scope and the number of
components it contains. The components will need to be deployed on the relevant
workstation around three to four months, according the system's existing requirements.
1.7. Outline of the Project

In Chapter 2, previously finished works analyzed that employed a variety of machine learning
techniques to predict prices of the houses. Chapter 3 discusses the procedures followed in the
selection of tools, technologies, system design and Machine learning algorithm as well as the
evaluation metrics used to assess the performance of the models. Chapter 4 demonstrates the data
description and exploratory data analysis. It also includes some steps such as gathering, cleaning,
pre-processing and altering nominal attributes and Chapter 5 displays how to build the model,
and best fitted model and analysis of the results. At the end, Chapter 6 extrapolates the
conclusion, limitations of the study and future recommendations.
12
Chapter 2
Literature Review
The study contributes to the increasing amount of information about machine learning in the
housing market. Numerous studies have been done on how to predict property prices, including
those using exponential smoothing and traditional regression models. In this Chapter, only
present the literatures relevant machine learning based housing price prediction. Numerous
aspects have been taken into consideration during the comprehensive study of house price as
well as examine relevant works on this topic.
2.1. History
For the purpose of modeling home prices and real estate values, numerous studies have been
conducted. Hedonic regression, which was first created in the 1960s, has been the most often
used method since it enables the breakdown of total housing spending into the values of its
individual components. The widely used technique of hedonic modeling assumes that a
commodity is a heterogeneity item that may be divided into features like internal structures and
local geographical elements. Hedonic regression is used to quantify the association
between prices and house features, as well as nearby assets.
2.2. Related Works

Regression analysis and particle swarm optimization used by Alfiyatin et al., (2017) to forecast
home values in Malang city using “NJOP” houses data. Regression analysis was used to find the
best coefficient in prediction, while particle swarm optimization is used to choose the affect
factors. The finding of this study demonstrated the suitability of combination of regression and
particle swarm optimization, obtaining “Isotonic distributional regression (IDR) =14.186” as the
lowest possible prediction error.
Wang and Wu (2018) developed a house price estimation model using Random forests. Random
forests model provides a better prediction overall than that of the standard linear regression
model and can better capture latent non-linear relationships between the pricing and attributes of
houses. Researcher carried out quantitative experiment on the North Virginia housing market
records to firmly support the findings of this study.
13
By carefully considering data processing, feature engineering, and combined prediction, Fan et
al. (2018) offered a house prices prediction model in Ames and Iowa based on the data set
prepared by "D. D. Cock" (De Cock, 2011) and the contest organized by "kaggle.com". After
taking the logarithmic of all the testing dataset, the Root Mean Square Error (RMSE)
was 0.12019, demonstrating strong performance and a low degree of over-fitting.
In the study of Varma et al., (2018) real factors are used to predict prices of the houses. The
researchers strived to base the assessments on each fundamental factor taken into account when
calculating the price. In this method, the researchers employed a variety of regression techniques.
The results are not solely determined by one method, but rather by the weighted mean of many
approaches, which provides the results that are the most accurate. The outcomes demonstrated
that this strategy produces least error and highest accuracy when compared to applying single
algorithms. The researchers also suggested using Google Maps to achieve precise valuations by
leveraging true location information.
In Australia, Phan (2018) employed various machine learning techniques on past real estate
transactions data in order to find best models that can be helpful to home buyers and sellers. The
significant price disparity between homes in Melbourne's most costly and least expensive areas
was made clear in this study. Additionally, the results showed that the Stepwise and Support
Vector Machine (SVM) combo, which is based on mean squared error (MSE) assessment, is a
competitive strategy.
To make predictions, Madhuri et al. (2019) employed a variety of regression approaches, such as
"multiple linear", "ridge", "LASSO", "gradient boosting", and "ada boost regression". The
effectiveness of each of the aforementioned methods has been tested on a data set in order to
forecast home prices. This study's goal was to help sellers estimate a property's selling price
precisely and to help readers determine the exact timeline for buying a home. Physical
conditions, concept, location, and other associated aspects that affect cost were also taken into
account in this study.
Sharma et al. (2020) developed a website where users could enter property information to predict
the housing prices, a date to predict prices up to that point, and a price range to offer the best
locations. This project relies on two datasets, one of which contains certain details of housing
14
sales in Mumbai, and the other of which includes the Mumbai house price index (HPI). The
current house price was predicted using a variety of feature selection and extraction techniques,
along with Multiple Linear Regression and the ARIMA model for price prediction. A content
based recommendation system was also used to suggest the best location within the area of
concern for their budget.
In the study, Thamarai and Malarvizhi (2020) used the characteristics of the homes, including the
number of bedrooms, the age of the building, the convenience of transportation from the site, and
the separation from schools and shopping areas. The proposed approach simulates housing
supply based on desired house characteristics and projected house prices. The West Godavari
region of Andhra Pradesh is where the model was created for a tiny town. The work involved
decision tree classification, decision tree regression, and multiple linear regressions using
machine learning techniques.
Sivasankar et al., (2020) discussed about the machine learning algorithm used to estimate the
future house prices. The researchers evaluated and investigated several prediction algorithms in
order to choose the best one. The past market trends, price ranges, and upcoming developments
was examined in order to anticipate the future price. A method for predicting housing values was
required because they increase yearly. The researchers used a variety of machine learning
regression algorithms, including Lasso Regression, Ridge Regression, Ada Boosting Regression,
XGBoost Regression, Decision Tree Regression, and Random Forest Regression, to construct a
model for predicting housing costs. All of the aforementioned strategies have been applied to a
data set in order to anticipate house prices and determine which is the most effective.
Khanum et al., (2021) developed a prediction model for evaluating the price based on the price-
affecting variables. Some supervised learning methods, including Bayesian classifiers or KNN
algorithms, are used in this study. The authors used these models to build a predictive model and
chose the top performing model by conducting a comparative analysis on the prediction errors
obtained between various models. A predictive model was created in an effort to assess the
pricing based on a variety of features. The writers developed this concept as a useful application
that would benefit both buyers and sellers in the real estate market.
15
In order to estimate real estate prices, Ho et al., (2021) employed three ML algorithms. The
algorithms were applied to the data set of housing transactions of 18 years in Hong Kong to
evaluate the performance of the models. When compared to Support Vector Machine (SVM), the
Random Forest (RF) and Gradient Boosting (GB) models have demonstrated superior
performance. Vignesh described the types of information that affect home prices and constructed
an accurate regression model utilizing tree-based algorithms. The researchers employed the
model on the publically available Ames House Price dataset on Kaggle.
2.3. Summary
Sr No. Dataset Authors Date Models
1 NJOP houses data Alfiyatin, Febrita, 2017 Combination of Regression
Taufiq, and and particle swam
Mahmudy optimization
2 North Virgina Wang and Wu 2018 Random Forest
Housing prices
3 D. D. Cook Iomes Fan 2018 Logistic Regression
dataset
4 Melbourne Housing Phan 2018 Combination of Stepwise
Market and Support Vector
Machine
5 Mumbai Housing Varma, Sarma and 2018 Linear Regression,
market Doshi Forest Regression,
Boosted Regression,
Neural Network
6 USA public output Madhuri, Anuradha 2019 Multiple linear, ridge,
data and Pujitha LASSO, Gradient boosting,
and Ada boost regression
7 Mumbai housing sales Sharma, A., 2020 Multiple Linear Regression
and Mumbai house Sonawale, P., and ARIMA model
price index Ghonasgi, D., and
Patankar
16
8 Andhra Pradesh’s Thamarai and 2020 Decision tree classification,
West Godavari area Malarvizhi decision tree regression, and
housing prices multiple linear regressions.
9 USA public output Sivasankar, Ashok 2020 such Lasso Regression,
data and Madhu Ridge Regression, Ada
Boosting Regression,
XGBoost Regression,
Decision Tree Regression,
and Random Forest
Regression
10 Real Estate Data Khanum, Pawar & 2021 Bayesian classifiers or KNN
Anitha algorithms
11 Ames House Price Ho, Tang & Wong 2021 Support Vector Machine,
the Random Forest and
Gradient Boosting
Regression
Table 1 Summary Table of Related Works
2.4. Arguments
There are numerous property sales advertising platforms, including Zameen.com, Olx and many
more, where properties are offered for sale, purchase, or rental. Unfortunately, there are many
pricing inconsistencies in each of these platforms, and there are certain instances where
comparable properties are priced unevenly, which results in a lack of clarity and reliability.
There is no way to confirm the accuracy of the information, thus buyers may occasionally feel
that a specific listed house's valuation is unjustifiable. In the grand scheme of things, solving this
problem will help both customers and the real estate business since the majority of consumers
view the transaction fees to be extremely expensive. Proper assessments and valid prices of
properties can restore a great deal of confidence and transparency to the real estate industry.
As a result of the growing trend toward Big Data, machine learning has lately emerged as a
crucial prediction approach since it can anticipate property prices more precisely based on their
features, regardless of the data from previous years. Many studies examined this topic and
17
showed how successful machine learning algorithms were, but most of them compared the model
performances without taking the combination of different machine learning algorithms into
account.
2.5. Conclusion
The majority of the literature review is based on full-text articles that are accessible online, open
access articles, Google scholars, and publications like Research Gate. The goal of the literature
review is to provide strong foundations for machine learning's regression techniques, as well as
how they can be precisely used to predict prices of houses. The review of related studies and the
feature engineering techniques applied in this study are provided in the literature study. In
addition, the evaluation measures that are employed to assess the effectiveness of the algorithms
are also studied as well as the variables that were applied to the local dataset.
Chapter 3
Methodology
The house price prediction can be handled using a variety of tools and technologies. The ones
used in this thesis were selected for their usability and accessibility. In this chapter, the tools,
18
technologies and proposed methods will be discussed. In this chapter, also go through the
methods of regression, as well as relevant concepts and techniques, and how they might be used
to predict house prices for the future. Additionally, the evaluation metric used to assess the
performance of the models will also be discussed.
3.1. Tools
 Microsoft Word
 Snipping Tool (Screenshot)
 Anaconda Prompt
 Jupyter Notebook
3.2. Technology
Python is a widely used programming language for machine learning due to its readability and
accessibility. It has a highly robust experience in developing that uses open-source libraries, the
majority of the academic concerns are determined using this standard programming language.
The following Python libraries are used in this thesis to preprocess, visualize and analyze the
data: Pandas, NumPy, and Matplotlib. Skearn library provides for the implementation of the
prediction models. Anaconda, which includes the Jupyter Notebook software are used. It
includes all of Python's most recent and updated libraries, which will be really helpful for putting
a machine learning technique into practice.
19
3.3. Project Planning
Figure 2 Project Planning
3.4. System Design

1. Data Collection
First is to gather information about housing price from online repository. In this data some
features of house and one target variable that’s "Price." The information gathered must be
accurately categorized and organized. When begin to solve any machine learning issue must start
with data entail. The dataset must be valid or there will point in data analysis.
2. Data Preprocessing
Our data is cleaned up at this stage. There could be insufficient data in our dataset. Three
possibilities exist for our missing values; first remove the missing information, second one is
remove the entire variable having missing value or replace the values with mean or median.
3. Feature extraction
20
In this phase, the least important and weak correlated variables are dropped. Features are
extracted with high correlation with target variable.
4. Regression modeling
Data is divided into two parts: Testing and training. 80% of the data are used to train the model,
while the remaining 20% are used to test the model. The training set contains the target variable.
Different machine learning regression algorithms are used to train the models.
5. Result
Finally, a test dataset is fed to the trained model, evaluate the models and predict the prices of
houses using the best model among of them.
Figure 3 System Architecture
3.5. Regression Models

The statistical method of regression analysis is used to ascertain the contribution of a group of
independent variables to the dependent variable. Regression analysis is a suitable method for this
thesis project since it enables us to evaluate the unique performance of each factor. Finding the
factors that have a greater impact on a house's overall pricing represents one of the specific
objectives. Additionally, we will apply regression techniques to develop and assess prediction
models on the dataset of house prices.
21
3.5.1. Linear Regression
Linear regression is the simplest and most straightforward prediction method when modeling the
linear relationship between a target variable and independent factors in statistics. OLS regression
model is used when there is only one influential factor. Multiple linear regressions are the
procedure used when there are multiple significant predictors. The multivariate linear regression
is a method that predicts numerous associated target variables as opposed to a single dependent
variable (Zhou, 2020).
Linear predictor functions are used in linear models to predict relationships, with the model's
unknown parameters being inferred from the dataset. The "conditional mean" of the output is
typically considered to be a linear function of the values of the independent variables; the
"conditional median" or another statistical parameter is sometimes employed. While multivariate
models focus on the "joint probability distribution" of these components, linear regression, like
all other forms of regression models, focuses on the "conditional probability distribution" of the
output given the values of the classifiers.
The first regression modeling approach was linear regression that completed in-depth research
and applied in many real-time applications. That is because model with linear relation to their
random variables are simpler to fit than model with non-linear relation to their factors, and also it
is simpler to calculate the statistical features of the obtained parameter estimation (Wikimedia
Foundation, 2022).
Hyper parameters Detail

alpha Control regularization (L2) strength
max_iter Maximum number of iterations (default : true)
tol Precision of the solution
solver Computation routines
Table 2 Important Hyper parameters for OLS regression
3.5.2. Lasso Regression

The lasso regression modeling approach in machine learning combines feature selection
and normalization to improve the predictability and interpretability of the predictive method it
22
generates. The lasso estimator was initially developed for least squares modeling, and this
specified situation exposes a lot about the estimator's behavior, including its relations to ridge
regression, optimal features extraction, and therefore known as "soft thresholding". It also
demonstrates that if covariates are linearly related, the estimated coefficients do not necessarily
have to be distinct.
Although lasso optimizer was initially developed for least - square, it is directly applied to a
broad range of predictive methods, such as simple linear models, general estimator equation, and
multivariable logistic regression models. The type of the constraints affects Lasso's capacity to
execute subgroup selections, and it can be interpreted in a number of ways, including in regards
to geometry, Probabilistic analytics, and linear analysis. By changing the model fitting procedure
to just employ a selection of the available covariates in the final model rather than all of them,
Lasso was developed to increase the predictability and applicability of regression models.

alpha Control regularization (L2) strength
max_iter Maximum number of iterations (default : true)
tol Precision of the solution
selection Selection of the features (random, cyclic)
Table 3 Important Hyper parameters for lasso regression
3.5.3. Random Forest

A random forest, as the name suggests, is an ensemble of several different independent decision
trees. The forecast generated by our model is the class with the highest scores. Individual trees in
the random forest individually send out a class forecast. The random forest model performs
better than any of its component models when it is a combination of many generally statistically
independent models, according to data science theory (Louppe, 2014).
The generic method of bootstrap aggregating also known as bagging is applied to tree learners
using the random forest training algorithm. Because the model's variance is reduced while the
bias remains constant, the bootstrapping method improves model performance. In other words,
although the predictions of a single tree are very sensitive to noise in its training set, the average
of multiple trees, assuming the trees are not coupled, is not. Bootstrap sampling is a technique
23
for breaking correlated trees by exposing them to other training sets. Strongly linked trees would
result from merely training many trees on a single training set.

n_estimators Number of boosted trees
max_depth Maximum depth of trees
max_features Maximum number of features
criterion Metrics to measure the quality of nodes
Table 4 Important Hyper parameters for Random Forest regression
3.5.4. Extreme Gradient Booster (XGBoost)

The XGBoost decision-tree-based ensemble machine learning approach makes use of the
gradient boosting framework. For prediction problems involving unstructured data, artificial
neural networks frequently perform better than any existing algorithms or frameworks. However,
decision tree-based algorithms are currently thought to be the best for managing small to medium
volumes of structured data.

n_estimators Number of boosted trees
max_depth Maximum depth of trees
Gamma Minimum loss reduction
learning_rate Learning rate
Table 5 Important hyper parameters for XgBoost
3.6. Performance Metrics
3.6.1. Root Mean Square Error (RMSE)

"The square root of the second sample moment of the differences between predicted values and
observed values, or the quadratic mean of these differences", is represented by the root mean
square error (RMSE). The square root of the mean squared errors is known as root mean square
error (RMSE). Greater errors have an excessively high impact on root mean square error
(RMSE) because the impact of every error on RMSE is proportional to the amount of the squared
error. Root mean square error (RMSE) thus becomes prone to outliers. The Root mean square
error is calculated as:
24
√
n
∑ ( x i− x^i )2
x=1
RMSE=
n
In the above formula, x i are the observed values and x^iare the estimated values.
3.6.2. Coefficient of Determination (R2)

The percentage of the variation in the dependent variable that can be predicted from the
independent variables is known as the coefficient of determination, or R 2. It is not suitable to
base the evaluation of the goodness of fit of predicted vs. actual values on the coefficient of
determination (R2) of the linear regression. The goodness of fit analysis should only take into
account one specific linear correlation which is one on one line, while the (R 2) measures the
strength of any linear correlation between predicted vs. actual values (Fernando, 2020)
2
R =1−
∑ of Square of residuals
Total ∑ of squares
3.6.3. Explained Variance Score

The degree of the variation of a machine learning model's predictions is measured using the
explained variance. It is, in a nutshell, the discrepancy between the predicted value and the
expected value. “An Explained Variance Regression Score is a Linear Regression Score
Function based on Explained Variance.”
variance(expected value− predicted value)

explained variance=1−
variance( expected value)
3.7. Residual Analysis

Critical presumptions made by regression models are likely to have an inaccurate impact on the
predicted gradients or predicted values. As a result, any inferences drawn from these findings
would be incorrect. One of these presumptions is that there is a multivariate linear relationship
between the predictor factors and the output. When the horizontal line in a residual versus fitted
values graphic has a random dispersion above and below it, linearity is likely satisfied. On the
other hand, a normal quartile to quartile (Q-Q) plot resembles a linear pattern, proving that the
hypothesis of normality is true. In addition to showing a random dispersion above and below the
horizontal line, a magnitude plot also demonstrates the homo-skew of the residual variance.
25
Chapter 4
Dataset Description and Analysis

The purpose of this chapter is to present deep preliminary analyses of our selected housing price
dataset. In addition, we want to find out the outliers and study the distributions and descriptive
statistics of all the variables of the dataset. The relationships between the two variables are
also discussed in this section. A technique that is frequently used to examine how two variables
are linearly proportional is correlation between the factors. It is hardly unusual to discover that
some variables have a strong correlation with one another when discussing house features.
4.1. Dataset Collection

One of the initial phases in going to develop any model is data collection. The data that is
gathered must be as accurate as possible because it directly affects how well the model predicts
the future. Data integrity may be impacted by how acceptable data are selected for a research
project. The major objective of data selection is to identify the appropriate data kinds, sources,
and methods that allow researchers to appropriately address the problem of the study. The nature
of the research, the volume of preceding research, and the availability of access to pertinent data
sources are the primary determinants of this choice, which is typically based on work ethic. The
publically available dataset of housing prices used in this project are collected from the Kaggle
repository that contains 13 variables and 545 attributes.
4.2. Dataset Description

There are 13 variables in the dataset, of which 6 are numeric and 7 are categorical. Numerous
dimensions like area, the number of rooms, number of bathrooms, and other information that a
normal home buyer would like to know before buying a property are among these elements. The
Statistical description of all factors is given below in Table 6 and Table 7.
Price Area Bedroom Bathroom Stories Parking

s s
Mean 4.766729e+0 5150.541 2.965138 1.286239 1.80550 0.69357
26
6 2 5 8
Std. dev. 1.870440e+0 2170.141 0.738064 0.502470 0.06749 0.86158

6 0 2 6
Minimum 1750000 1650 1 1 1 0
Q1 3430000 3600 2 1 1 0
median 4340000 4600 3 1 2 0
Q3 5740000 6360 3 2 2 1
Maximu 13300000 16200 6 4 4 3

m
Table 6 Statistical Description of Numeric Variables
Uniqu Top Frequency

e
Main Road 2 yes 468
Guest Room 2 No 448
Basement 2 No 354
Hot Water Heating 2 No 520
Air Conditioning 2 No 373
Praferea 2 No 417
Furnishing Status 3 Semi-Furnished 227
27
Table 7 Statistical Description of Categorical Variables
4.3. Data Analysis

Data exploration, which is the first phase of data analysis, often entails exploring a data set's key
qualities, such as its size, accuracy, initial patterns in the data, and other factors. Although
Python, a more sophisticated statistical tool, can be used, it is often carried out by data analysts
utilizing visual analytics tools. A business must be aware of a number of factors before
implementing machine learning techniques, including the number of instances, attributes
involved in the dataset, missing values, and general assumptions the data is expected to support.
By introducing analysts to the data they will be working with, a preliminary analysis of the data
set can aid in the resolution of these queries. Data visualization is the process of presenting
information and data graphically. By leveraging visual features like charts, graphs, and maps,
data visualization tools provide a simple method to analyze and detect patterns, outliers, and
correlations within data properties. In the realm of big data, data visualization techniques and
technologies are vital to analyzing enormous amounts of data and producing statistical choices.
4.3.1. Target Variable-Price
Figure 4 Histogram plot of Class Variable “price”
28
The distribution of prices variable is positively skewed as shown in Figure 2, which indicates
that it isn't normal distribution. Hardly people can buy highly luxury housing, thus it is sensible
and logical. For model fitting, there is need to transform the prices variable.
4.3.2. Outlier Detection

Outliers are measurements that deviate significantly from the overwhelming bulk of the values in
the remaining data. Outliers affect regression results and could make models more difficult to
analyze. Data exploration and data visualization are effective methods for finding probable
anomalies. Some of the outliers fall outside the measurement range that is important to us for our
analysis while some do not always represent inaccurate data. The below given boxplot represent
the outliers present in our dataset. The graph showed that variable price and area have large
number of outliers.
Figure 5 Boxplot for dataset
Boxplots of the continuous variables demonstrated that the prices and area are right-skewed, with
outliers on the tails of the distribution. This tends to mean that while the majority of residences
are less expensive than the estimate, the sample also includes some expensive homes, which
have an impact on the typical home's value. The median price is a better indicator of usual values
29
because of this. A similar pattern observed for other numeric attributes. We will handle these
outliers in data preprocessing phase.
4.3.3. Correlation Analysis of Variables

All the variables have positive correlation with the response variable-price. Two numerical
variables have correlations with price greater than 0.5. They are both correlated positively. The
relationship between price and the two variables that have the strongest associations with price
are area and bathrooms will be illustrated graphically below. The greater the area of the house,
the greater will be its price and the same case with the number of bathrooms.
Figure 6 Correlation heat map for dataset
Stories and air conditioning are two factors that have a minimum 0.4 association to price.
“hotwaterheating” and basement are the two predictor variables having correlation value less
than 0.2 which represent a weak relationship with response variable.
30
4.3.4. Predictor Variable-Area
The area and price have the strongest correlation, 0.54 which is what we found. This variable
indicates the area (square feet) of the house. Frequency distribution plot of variable-area is
shown in below graph.
Figure 7 Histogram plot of Predictor Variable “area”
Only a small number of houses have very big Area, whereas the majority has the small Area.
This variable's distribution isn't uniform. The variable is positively skewed.
Some houses with big areas but modest prices stand out as glaring anomalies shown in below
scatter plot. We cannot remove these extreme cases with ease. This low price of house may be
due to a number of factors. They may be due to fewer facilities, based solely on my analysis of
the furniture status variable. The data is needed to be transformed.
31
Figure 8 Relationship between price and area
4.3.5. Predictor variable-bathrooms

The bathroom and price have a correlation value is 0.52, which is the second highest correlation.
The significant associations can be explained. The greater number of bathrooms, the higher will
be the price of the house. The below following frequency plot are the counts of bathrooms:
Figure 9 Frequency Plot of Predictor Variable “bathrooms”
The boxplot in Figure 7 depicts the relationship between predictor variable-bathrooms and price.
32
Figure 10 Relationship between price and bathrooms
We discovered a favorable correlation between the bathrooms and price. And rather than being a
linear relationship, it appears to be a quadratic one or something like that as shown in above plot.
This relationship is simple to comprehend. A buyer simply needs to spend a small amount of
money and acquire fewer bathrooms. However, it will be exceedingly challenging and expensive
for the buyer to raise the bathrooms in his home
Chapter 5
Implementation and Results

In this section, the data preprocessing techniques will be implemented to clean and transform the
data. Different regression techniques will be employed on the data of housing price. The goal is
to provide answers to the research questions so that a predictive model will build that will
accurately estimate prices of the houses. To choose the one best model performance of each
model will be measures and analyzed in this chapter.
5.1. Data Preprocessing

The data must first be cleansed before it can be used for prediction. Data preprocessing seeks to
improve the quality of the data by locating and removing flaws and inconsistencies from the
data. Data quality issues are frequently encountered as a result of misspelled words made during
data entry, missing values, duplicated records, outliers and data redundancy in various forms.
33
There are no missing and duplicated values in our dataset but we detected some outlier in the
above section.
5.1.1. Data Encoding

A model's performance is influenced not only rely on the model and parameter settings, but also
by the way various features under study and fed into the models. Encode the category data is
important because most of the models only accept numeric values. These categorical data must
be transformed into numerical values before fitting them into the models. There are seven
categorical variables in our dataset that we encoded using ordinal encoding preprocessing
method.
5.1.2. Data Transformation

Data transformation is an approach that is used to remove outliers. Data transformation is used to
lessen the effect of outliers. Multiple linear regressions are one of the main techniques used in
this project. The relationship between independent variable and target variable should be
roughly assumed linear; the independent variables should not be greatly skewed (that could
result in high power and influence); and residuals should all be independent of one another.
The data's normal distribution demonstrates how the data deviates from the mean. The mean is at
the center of a symmetric normal distribution. It is not normally distributed in this data. A
normal distribution of data is possible. A side-by-side comparison of the distributions before and
after the transformation is shown in Figure 2.10. The histogram of price and area in the
Synchronous reference frame subgroup display a positively skewed distribution before being min
max scaled.
34
Figure 11 Data Transformation
5.2. Splitting the Data

The data must be divided into a training set and a test set before fitting a model to the data. Each
model will be trained on the same training dataset and tested on the same test dataset. The
effectiveness of the following models will be determined by comparing their MSE and R-
squared scores. For the purpose of modeling, the data will be divided into two sections. First,
80% of the whole dataset is contained in the training dataset. Additionally, 20% of the total data
will be in the test dataset.
5.3. Optimum Hyper parameters

The sklearn library's built-in “random search” and “grid search cross validation” routines were
used to discover the model's suitable hyper parameters. These functions accept an input of values
for various hyper parameters and run various combinations of those values against the model to
see how it performs. In addition to the provided hyper parameter values, they determine which
combination of hyper parameter values produces the optimum result.
The top hyper parameter combinations were first compiled using the random search algorithm.
Then, using the grid search method, narrower ranges containing the hyper parameter values
discovered in the random search were gathered and thoroughly investigated.
5.3.1. Hyper parameters – Linear regression (lasso and ridge)

In the initial random search study of the hyper parameters, the following parameters and values
have been used:
35
The final alpha values after grid search cross validation are:
ML model Alpha values

Lasso regression 0.02
Ridge regression 200
Table 8 Optimum parameter values for lasso and ridge regression
5.3.2. Hyper parameters – Random Forest and xgboost

In the initial random search study of the hyper parameters, the following parameters and values
have been used:
The final hyper parameter values after grid search cross validation are:
ML model n_estimators Max_depth

Random forest regression 7 200
Xg-boost regression 7 200
Table 9 Optimum parameter values for lasso and ridge regression
5.4. Models Building

The five different types of linear regression models that were created and trained using the house
price data are the Simple Linear Regression model, Ridge Regression model, Random Forest
Regression model, xg-Bosst Regression model, and Lasso Regression model. The sk-learn
library in Python uses pre-built methods for all of the models. All models follow a similar set of
processes, which include defining a variable to store the model method, fitting the variables from
the train set into the model, and then producing predictions about the test set using the best.
36
model.
Figure 12 Fit models without hyper parameters
Figure 13 Fit models with hyper parameters
37
5.5. Models Evaluation
5.5.1. Performance metrics

The “explained variance score” metric, “root mean square error” and the “r2 score” metric
functions that are provided by the sklearn package in Python are used to assess our model. The
“explained variance score” measure should have a score above 0.60. A score of between 0.60
and 1.0 for the "explained variance score" is ideal. The "r2 score" (R-squared) evaluation metric
is the next one. An optimal construct should have "r2 score" of at least 0.60 and ideally more
than 0.70. RMSE values closer to 0 indicate that the model is ideal to predict the data.
5.5.2. Performance of Lasso Regression

Without hyper Without hyper
parameters tuning parameters tuning
R-Square (R2) -0.010380 0.539491
Root Mean Square (RMSE) 0.294062 0.198525

Explained Variance Score -2.220446e-16 0.542562
Table 10 Performance of Lasso Regression with and without hyper parameters
5.5.3. Performance of Ridge Regression

Without hyper With hyper
R-Square (R2) 0.671013 0.671013

Explained Variance Score 0.6716383 0.671638
Table 11 Performance of Ridge Regression with and without hyper parameters
5.5.4. Performance of Random Forest Regression

R-Square (R2) 0.602049 0.596942

38
Table 12 Performance of Random Regression with and without hyper parameters
5.5.5. Performance of XgBoost Regression

R-Square (R2) 0.589104 0.658599

Table 13 Performance of XgBoost Regression with and without hyper parameters
5.5.6. Comparison of performances of the models

Predictive Models R-Square Root Mean Square Explained Variance
(R2) (RMSE) Score
Linear Regression
0.668176 0.168520 0.668642
Ridge Regression
0.671013 0.167798 0.671638
Lasso Regression
0.539491 0.198525 0.542562
Random Forest
0.596942 0.185729 0.603260
Regression
XgBoost
0.658599 0.170934 0.659352
Table 14 Comparison between performances of the models
After observing the above table it is clearly seen that the linear regression (OLS) and Ridge
regression have “R-square” and “explained variance score” greater than 0.6 that indicates that
models are best to predict the data. But if compare the R-square value and explained variance
score, then the ridge has higher the R-square value and explained variance score than linear
regression which is 0.671 and 0.671 respectively. If observe the value of RMSE, it is seen that
ridge regression has RMSE value more closet to 0 than linear regression which is 0.167. Lasso
Regression performs the worst having values of R-square and explained variance score in
negative, less than 1 as well has the highest RMSE values among all the other models.
39
5.5.7. Residual and Q-Q Plots
The difference between the target variable's actual value and expected value, also known as the
prediction error, is referred to as a residual in the context of regression models. By showing the
difference between residuals on the vertical axis and the dependent variable on the horizontal
axis, the residuals plot allows identifying target areas that may be more or less mistake-prone.
The variance of the repressor's error is commonly examined using the residuals plot. If the points
are skewed about the horizontal axis, a regression model is usually suitable for the data.
The data on a Q-Q plot that closely resembles the center line denotes a normal distribution. If the
points deviate significantly from the line, the regression model needs to be adjusted by including
or excluding additional variables.
Figure 14 Residual and Q-Q plot for linear Regression
The above plots proved that the model is fit to predict the data accurately as the residual plot
displayed that the points are distributed along the horizontal axis and in the Q-Q plot there is a
straight line which represent the normal distribution.
Figure 15 Residual and Q-Q plot for ridge Regression
40
The above plots for ridge regression also proved that the model is fit to predict the data
accurately as in this case the residual plot also displayed that the points are distributed along the
horizontal axis and in the Q-Q plot there is a also straight line for train and test points which
represent the normal distribution.
Figure 16 Residual and Q-Q plot for lasso Regression
The data is not properly dispersed along the horizontal line and there is not straight line for train
and test that means the data is not normally distributed in Figure 16 Residual and Q-Q plot for
lasso Regression respectively. The observations represented that this model is not fit to predict
the prices of the houses.
Figure 17 Residual and Q-Q plot for Random Forest Regression
The residual and Q-Q plot for Random Forest regression represented is somehow better but not
fit to predict the house prices accurately.
41
Figure 18 Residual and Q-Q plot for XgBoost Regression
The residual and Q-Q plot for XgBoost regression represented is somehow better but not fit to
predict the house prices accurately.
Chapter 6
Conclusion and Future Directions

The conclusions drawn from the data on housing prices are included in this chapter. The research
questions, the objectives, and the study outcomes were used to draw the conclusions. The chapter
concludes with the study limitations and possibilities for the further research after summarizing
the ramifications of the thesis's findings.
6.1. Conclusion
The objective of this qualitative study was to create a machine learning model utilizing
information on global changes in housing prices. Housing.csv, secondary data from the Kaggle
Repository, was used in the study. There were no null values in their dataset. The study took
advantage of cutting-edge tools including machine learning techniques to create a predictive
price model that predicted housing price fluctuations in the future. The data was gathered for
cleaning, pre-processing, and analysis from the Kaggle website, an online database.
42
In this work, a predictive model was built that predicted house selling prices using machine
learning methods. This was achieved by selecting limited number of attributes from online
publically available data. The area of the house effect on the sale price was also investigated. The
study found the important price factors include house total area, number of bathrooms,
bedrooms, and air conditioning. The results of the study also show that the size of homes impacts
price more so than the number of bathrooms. Family size is a major consideration for anyone
purchasing a home to live in; one would want to choose a home that can suit their family size.
The thesis also demonstrated how crucial it was for our predictive system to receive information
in order to make precise pricing predictions. The simple available data was used in the study to
create a simple and accurate model.
The thesis used five machine learning methods, including Random Forest, Linear Regression,
Ridge Regression, Lasso Regression, and Extreme Gradient Boosted (XgBoost) Regression to
develop the model that can predict house prices. To evaluate the performance of the model Root
mean square error (RMSE), R-square and explained variance score metrics were used. In order to
achieve the goal of the study, these five models fitted to the house price dataset. After evaluating
the performance of the model, it was seen that linear regression and ridge regression are ideally
fit on the data. They can predict the prices of the houses more accurately that other three
algorithms. But between these two ridges regression is much better than the linear regression
with R-square value of 0.671 and RMSE closer to 0. Among five of them lasso regression
performed the worst with negative R-squared value and explained variance score. This model has
also the largest RMSE compare to others.
The model helps customers and potential real estate investors decide when is the best time to
make investments in the sector or to conduct business. According to the study, the best price
estimation model for the real estate industry was ridge regression.
6.2. Limitations
The thesis's limitations are primarily attributable to the data it uses and its methodological
approach. Our study aims to use simplified machine learning models that are simple to execute to
predict house price rise in the housing market.
43
Different linear regression models might perform in different ways depending on the dataset
used to train them. It is important to evaluate the models' advantages and disadvantages then
apply them appropriately. Models with a propensity for over fitting must be found using more
effective data transformation techniques in the bias and variance balance.
Big data distributions are preferred for statistical challenges. Large datasets enhance the quality
of the study. The data's inconsistency is a major factor in determining the accuracy of the
predictions. Finding a reliable repository with a large number of features in a normally
distributed dataset is the key challenge.
The frequency of errors can be drastically decreased by optimizing the model's parameters. This
section calls for extensive experience dealing with various datasets and utilizing a variety of
model stacks. In each model, the behavior of the learning rate, the number of leaves, and other
optimizations might vary.
6.3. Future Work

To further enhance the results, future work on this study could be organized into different key
areas, which are accomplished by:
 The pre-processing techniques employed significantly improve the precision of the

predictions. To improve predictive performance, try combining several pre-processing
techniques in different manners.
 Utilize the existing features and see if they can be integrated as binning features, as this
has demonstrated to improve the results.
 Using a variety of regression techniques for data training and model building, such as
elastic net regression, which combines "L1" and "L2" norms to broaden the comparisons
and evaluate the effectiveness.
 The relationship in the variables of dataset has been revealed by the correlation analysis.
In order to make the dataset rich with attributes that change and can give a strong
correlation association, an approach at data enhancement is necessary. The variables
examined in this thesis have just a marginal relationship to the house price.
Consequently, by including additional variables, such as Growth, income, and
population, to the dataset that influence the pricing of houses I n order to broaden the
44
range of variables that affect house prices. This might also result in more accurate results
for research questions 1.
References
Alfiyatin, A. N., Febrita, R. E., Taufiq, H., & Mahmudy, W. F. (2017). Modeling house price
prediction using regression analysis and particle swarm optimization case study: Malang,
East Java, Indonesia. International Journal of Advanced Computer Science and
Applications, 8(10).
Choong, W. C. (2018). Statistical Analysis Of Housing Prices In Petaling District Using Linear

Functional Model (Doctoral dissertation, UTAR).
De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression
project. Journal of Statistics Education, 19(3).
Ebekozien, A., Abdul-Aziz, A. R., & Jaafar, M. (2019). Housing finance inaccessibility for low-
income earners in Malaysia: Factors and solutions. Habitat International, 87, 27-35.
45
Fan, C., Cui, Z., & Zhong, X. (2018, February). House prices prediction with machine learning
algorithms. In Proceedings of the 2018 10th International Conference on Machine
Learning and Computing (pp. 6-10).
Fernando, J. (2022, February 8). What is R-squared? Investopedia. Retrieved August 16, 2022,
from https://www.investopedia.com/terms/r/r squared.asp#:~:text=R%2Dsquared
%20values%20range%20from,)%20you%20are%20interested%20in).
Glaeser, E., Huang, W., Ma, Y., & Shleifer, A. (2017). A real estate boom with Chinese
characteristics. Journal of Economic Perspectives, 31(1), 93-116.
Ho, W. K., Tang, B. S., & Wong, S. W. (2021). Predicting property prices with machine learning
algorithms. Journal of Property Research, 38(1), 48-70.
Housing Prices Dataset. (2022, January 12). Kaggle. Retrieved September 6, 2022, from
https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Jafari, A., & Akhavian, R. (2019). Driving forces for the US residential housing price: a
predictive analysis. Built Environment Project and Asset Management, 9(4), 515-529.
Khanum, F., .P .M , N., Pawar, N., .D, V., & Anitha, R. (2021). Real Estate House Price
Prediction using Machine Learning. International Journal of Engineering Science and
Computing, 11(7), 28452–284524.
Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint

arXiv:1407.7502.
Madhuri, C. R., Anuradha, G., & Pujitha, M. V. (2019, March). House price prediction using
regression techniques: a comparative study. In 2019 International conference on smart
structures and systems (ICSSS) (pp. 1-5). IEEE.
Phan, T. D. (2018, December). Housing price prediction using machine learning algorithms: The
case of Melbourne city, Australia. In 2018 International conference on machine learning
and data engineering (iCMLDE) (pp. 35-42). IEEE.
46
Sharma, A., Sonawale, P., Ghonasgi, D., & Patankar, S. (2022, May). House price prediction
forecasting and recommendation system using machine learning. International Research
Journal of Engineering and Technology 7(5) 1540-1550
Simon, A., & Singh, M. (2015). An overview of M learning and its Ap. International Journal of
Electrical Sciences Electrical Sciences & Engineering (IJESE), 22.
Sivasankar, B., Ashok, P. A., N., Madhu, G., & S. F. (2020, July). House Price Prediction.
International Journal of Engineering Science and Computing, 8(7), 2347-2693
Temur, A. S., Akgün, M., & Temur, G. (2019). Predicting housing sales in Turkey using
ARIMA, LSTM and hybrid models.
Thamarai, M., & Malarvizhi, S. P. (2020). House Price Prediction Modeling Using Machine
Learning. International Journal of Information Engineering & Electronic Business, 12(2).
Varma, A., Sarma, A., Doshi, S., & Nair, R. (2018, April). House price prediction using machine
learning and neural networks. In 2018 second international conference on inventive
communication and computational technologies (ICICCT) (pp. 1936-1939). IEEE.
Vignesh, M., Vijay, V., Krishna, S., & Sathyamoorthy, K. HOUSE PRICE PREDICTION
USING MACHINE LEARNING BY RANDOM FOREST ALGORITHM.
Wang, C., & Wu, H. (2018). A new machine learning approach to house price estimation. New
Trends in Mathematical Sciences, 6(4), 165-171.
Wikimedia Foundation. (2022, August 11). Linear regression. Wikipedia. Retrieved August 16,
2022, from https://en.wikipedia.org/wiki/Linear_regression
Zhou, Y. (2020). Housing sale price prediction using machine learning algorithms (thesis).
47
Appendix
Appendix A – Data
Figure 19 Sample of First ten rows of Dataset
48
Figure 20 Sample of Last ten rows of Dataset
49
Appendix B – Project Specification
50
51
52
53
54
55
56
57
58
Appendix B
59

Housing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Housing

Uploaded by

Copyright:

Available Formats

School of Computing

MSc in Data Analytics

1.2. Needs and Motivations................................................................................................................8

1.3. Problem Statement.......................................................................................................................9

1.4. Aim and Objectives...................................................................................................................10

1.5. Research Questions....................................................................................................................10

1.6. Project Constraints.....................................................................................................................11

1.7. Outline of the Project.................................................................................................................12

2.2. Related Works...........................................................................................................................13

3.3. Project Planning.........................................................................................................................20

3.5. Regression Models.....................................................................................................................21

3.5.1. Linear Regression..............................................................................................................22

3.5.2. Lasso Regression...............................................................................................................22

3.5.3. Random Forest...................................................................................................................23

3.5.4. Extreme Gradient Booster (XGBoost)...............................................................................24

3.6. Performance Metrics..................................................................................................................24

3.6.1. Root Mean Square Error (RMSE)......................................................................................24

3.6.2. Coefficient of Determination (R2)......................................................................................25

3.6.3. Explained Variance Score..................................................................................................25

3.7. Residual Analysis......................................................................................................................25

Dataset Description and Analysis..............................................................................................................26

4.1. Dataset Collection......................................................................................................................26

4.2. Dataset Description....................................................................................................................26

4.3. Data Analysis.............................................................................................................................27

4.3.1. Target Variable-Price.........................................................................................................28

4.3.2. Outlier Detection................................................................................................................28

4.3.3. Correlation Analysis of Variables......................................................................................29

4.3.4. Predictor Variable-Area.....................................................................................................30

4.3.5. Predictor variable-bathrooms.............................................................................................32

Implementation and Results......................................................................................................................33

5.1. Data Preprocessing....................................................................................................................33

5.1.1. Data Encoding....................................................................................................................33

5.1.2. Data Transformation..........................................................................................................33

5.2. Splitting the Data.......................................................................................................................34

5.3.1. Hyper parameters – Linear regression (lasso and ridge)....................................................35

5.3.2. Hyper parameters – Random Forest and xgboost...............................................................35

5.4. Models Building........................................................................................................................36

5.5. Models Evaluation.....................................................................................................................37

5.5.1. Performance metrics..........................................................................................................37

5.5.2. Performance of Lasso Regression......................................................................................37

5.5.3. Performance of Ridge Regression......................................................................................38

5.5.4. Performance of Random Forest Regression.......................................................................38

5.5.5. Performance of XgBoost Regression.................................................................................38

5.5.6. Comparison of performances of the models.......................................................................38

5.5.7. Residual and Q-Q Plots......................................................................................................39

Conclusion and Future Directions.............................................................................................................42

6.3. Future Work...............................................................................................................................43

1.2. Needs and Motivations

1.3. Problem Statement

1.4. Aim and Objectives

1.5. Research Questions

Figure 1 Project Constraints

1.7. Outline of the Project

2.2. Related Works

Figure 2 Project Planning

3.4. System Design

Figure 3 System Architecture

3.5. Regression Models

Hyper parameters Detail

3.5.2. Lasso Regression

Hyper parameters Detail

3.5.3. Random Forest

Hyper parameters Detail

3.5.4. Extreme Gradient Booster (XGBoost)