AS1 GCS200400 TranTrungTien BI

University of Greenwich
Research paper
_________________________________________________________
HOUSE PRICE PREDICTION BASED ON LINEAR REGRESSION
MODEL AND MAP VISUALIZATION
Supervisors:
N. Xuan Sam (SamNX2@fe.edu.vn)
Student name: Tran Trung Tien

Student ID: GCS200400
Class: Business intelligence
ABSTRACT
This paper focuses on formulating a feasible method for house price prediction. A dataset
containing features and house price of King County in the US is used. During the data
preprocessing, extreme values are minorized and highly correlated features are removed. It is
found that the current mainstream model only studies the trend of the housing price index itself,
and is not sensitive to the characteristics of the house itself. Therefore, this paper uses a multiple
regression model to integrate the advantages of external factors and use the improved housing
price composition establishing a multiple regression prediction model with the particle swarm
optimization. It not only makes up for the disadvantages of poorly determined housing price
regression indicators and lack of statistical data in multiple regression prediction, but also enables
the model to reflect the inflection point of housing prices in advance. Location, living space and
condition of the house are the most important features influencing house price. After comparison
and contrast with other papers, it is attested that findings in this paper conform to real life. This
paper formulates a model that fits better than preceding studies for house price prediction and
makes necessary supplement to the exploration of features that influence house price from a
microscope.
Key words: Linear regression model (LRM), Price prediction (PP), Map visualization, King
County, Washington DC house price
Table of Contents
1. INTRODUCTION .................................................................................................................................................. 1
2. RELATED WORKS ................................................................................................................................................ 3
2.1 DATASET: ............................................................................................................................................................... 3
2.2 REGRESSION MODELS: ...............................................................................................................................................8
2.3 PYTHON AND LIBRARIES: .......................................................................................................................................... 10
2.3.1 Python: ......................................................................................................................................................10
2.3.2 Python libraries and packages: .................................................................................................................11
3. SIMULATION SETUPS AND RESULT EVALUATIONS: .......................................................................................... 12
3.1 SIMULATION SETUPS: ..............................................................................................................................................12
3.1.1 Correlation between variables: .................................................................................................................12
3.1.2 Distribution of target variable and pairplot: .............................................................................................16
3.1.3 Visualizing data on box plots.....................................................................................................................19
3.1.4 Scatter plot between Price and other variables to view the relationship between them: ........................23
3.1.5 How seasonality can affect house prices: .................................................................................................24
3.2 LINEAR REGRESSION MODEL: ....................................................................................................................................26
3.2.1 Linear regression model ............................................................................................................................26
3.2.2 Important features for determining the house price:................................................................................29
3.3 RESULT EVALUATIONS .............................................................................................................................................30
4. MAP VISUALIZATION: ...................................................................................................................................... 32
5. CONCLUSIONS AND FUTURE RESEARCHES: ...................................................................................................... 36
5.1 CONCLUSIONS: ......................................................................................................................................................36
5.2 FURTHER DISCUSSION AND FUTURE RESEARCHES:.......................................................................................................... 37
6. EXPECTED OUTCOMES ..................................................................................................................................... 39
REFERENCES......................................................................................................................................................... 40
Table of Figures:
Figure 1: Price house prediction ..................................................................................................... 1
Figure 2: Check if the dataset contains null value or not ................................................................ 6
Figure 3: Check the datatype of all variables in dataset ................................................................. 6
Figure 4: The distribution of numerical feature values across the samples .................................... 7
Figure 5: The summary of the methodology................................................................................... 7
Figure 6: Graphical Display of Linear Regression Assumptions ................................................... 9
Figure 7: Python for data science.................................................................................................. 10
Figure 8: Pearson heatmap correlation ......................................................................................... 14
Figure 9: Ranking for all other variables correlation with price ................................................... 15
Figure 10: Distribution of target variable price ............................................................................ 16
Figure 11: Histogram .................................................................................................................... 17
Figure 12: Pairplot ........................................................................................................................ 18
Figure 13: Box plot definition....................................................................................................... 19
Figure 14: Bedrooms and floors box plots .................................................................................... 20
Figure 15: Waterfront, view, grade box plots ............................................................................... 21
Figure 16: Bathrooms and condition box plots ............................................................................. 22
Figure 17: Scatter plot demonstrating the association between other variables and price ........... 23
Figure 18: Years and months box plots ........................................................................................ 24
Figure 19: Trend line of the month and price ............................................................................... 25
Figure 20: Average price by months ............................................................................................. 25
Figure 21: Average price by seasons ............................................................................................ 26
Figure 22: OLS regression result .................................................................................................. 27
Figure 23: Linear regression model .............................................................................................. 28
Figure 24: Features ranking .......................................................................................................... 29
Figure 25: Scatter plot of King County shape .............................................................................. 32
Figure 26: Scatter plot with longitude and latitude of King County............................................. 33
Figure 27: House price with map visualization ............................................................................ 33
Figure 28: Highest and lowest house price with heatmap ............................................................ 34
Figure 29: Average price by each zipcode (area) ......................................................................... 34
Figure 30: Average price by direction .......................................................................................... 35
Figure 31: Scatter plot house price by direction ........................................................................... 36
Table of Tables
Table 1: Dataset description............................................................................................................ 3
House Price Prediction |1
1. INTRODUCTION
Owning a house is one of the biggest dreams of the majority of people and one of the largest, most
expensive purchases in a person’s life. However, it is often difficult to determine or predict housing
prices. In fact, the trend of house prices is always a controversial topic as its fluctuation will pose
a huge effect on the entire economy. The rise in house price means growth in non-financial assets
which ultimately increase personal wealth, stimulating household consumption and boosting the
economy; however, a decrease in house price limits an individual’s borrowing capacity, crowding
out investments due to the evaporation in the value of collaterals (Cooper and Statistics, 2013).
The shock in the global economy caused by the 2008 housing bubble perfectly explains the
importance of a stable and measurable house price. The turmoil in house prices causes an
unexpected rise in real long-term interest rates, bankruptcy in financial institutions and global
economic depression (Sinai et al., 2005). Although it is hard to control the house price, it is possible
to predict it.
Figure 1: Price house prediction
Many scholars have conducted research on this issue. For instance, Hirata et al. have used time-
series models to determine that house prices have become more synchronized over time and the
FAVAR model to find out that global interest rate shock has the most considerable influence on
global house price, especially in the US (Kose et al., 2013). Shishir Mathur has provided insight
from a micro perspective in his report, stating that quality and size are two factors, contributing to
house price (Mathur and Society, 2019). In his opinion, this assumption can be explained through
the perceived value of the house. The property assessors will evaluate the size and quality of the
house during value assessment processes for house reselling, which will determine the value of
the house. Property developers also take houses’ quality and size into consideration while they
initially design the project and pricing for the property. A bigger size and better quality will bring
a higher perceived value to both assessors, developers, and buyers. In Shishir Mathur’s report, he
also mentioned another contributor – the level of maintenance. With the increased investment in
refurbishing before offering for sale, the house owners will expect a higher dealing price due to
their value addition through the maintenance. Although prior researches make a valid analysis,
they failed to discuss the simultaneous effect of those factors.
Some factors may contribute more to the results than others. Besides, their conclusions are based
on theoretical knowledge and lack practical proof. This paper goes beyond previous economic
analysis and uses linear regression as well as map visualization to explore the country-wide house
price. This paper also assumes that the two factors, size and quality mentioned by Shishir Mathur,
will affect the house price, but this paper will use linear regression to prove the relationship. Other
than these two factors, some other factors on house price, including location, size, and overall
structure of the house and grading from the agency will also be evaluated. Linear regression and
map visualization are most obvious advantages are that it can automatically solve a wide range of
problems and efficiently handle big datasets (Few and Edge, 2009). These two benefits allow us
to prove the assumptions through analyzing a huge amount of historical data and taking multiple
factors into consideration to present a comprehensive model efficiently. The research studies the
house price in King County, US, during a 2-year period from 2014 to 2015. According to the data
gathered by Washington Government, King County has the highest estimated population of
2,052,800 in 2015 among all counties in Washington (Wang and Zhao, 2022). With a higher
population, King County has more potential house buyers and higher house demands; thus, house
price data in King County will be more complete and more precise. This paper employs linear
regression and map visualization with folium package to identify the important influencer of the
house price. The linear regression model will be selected through training and testing, which will
allow us to have the most accurate result. The results will conclude important micro features in
determining the house price, including the location, size, and gradings.
Housing prices in King County, WA have been exploding over the past decade due to many factors
including the availability of great jobs, a rich culture, and a plethora of outdoor recreation
opportunities in the surrounding areas. It is now harder than ever for house hunters to find a bargain
in what seems like is a perpetual seller's market. My goal in this project to determine where and
what types of houses are still available at decent prices and develop a multiple linear regression
model capable of giving the patient investor an indication whether or not he may be looking at a
bargain. Meeting this goal will require careful data cleaning, exploratory data analysis capable of
answering questions relating to our goal, thoughtful feature engineering, and finally an iterative
approach to multiple regression modelling. The King County dataset contains information for more
than 21000 house sold during 2014 and 2015 in King County. For every house, there are over 20
attributes ranging from location, living square foot, grade and renovation year to build year.
This will provide a guide for future house price prediction to not only consider the macro effects
but also think about the micro factors. The rest of the paper is organized as follows: section 2 and
section 3 introduce the source of the data and the methodology used in the evaluation. The fourth
section discusses the evaluation and the results. Eventually, the conclusion and future research will
be in section 5.
Objectives
I. Overview of Data
II. Data pre-processing
III. Data visualization and pattern discovery
IV. Predictive Modeling
V. Model implementation an evaluation result

VI. Conclusion and plan for future works
2. RELATED WORKS
2.1 Dataset:
We utilize house data from Washington DC, one of the largest cities in the United States, as an
example to understand the domain situation. Initially selected 21613 sets of data, containing a total
of 21 features from Kaggle, “House Sales in King County, USA”. King County is a county in US
of Washington. The population is approximately 2,117,125 in July 2015, which is also the County
with most population in Washington. There are 893,157 housing units as in July 2015 and the
median income is $81916 (Murray, 2016). From the table shown below, we can see the
independent variables from the housing dataset are the explanatory variables. The independent
variables are date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view,
condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, latitude, longitude,
sqft_living15, and sqft_lot15. We can see that these variables include categorical variables,
numerical variables, and time series variables. The dependent variable is the sale price of houses
from May 2014 to May 2015 in King County USA. This dataset contains house sale prices for
King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
Table 1: Dataset description
Variable
Description Sample Data
Name
Id Unique identified for a house 7129300520; ...
20141013T000000;
date Date house was sold (with the format yy/mm/dd)
20141209T000000; ...
price Price is prediction target 221900; 538000; ...
bedrooms Number of bedrooms per house 3; 2; ...
Number of bathrooms per bathrooms (where .5

bathrooms 1; 2.25; ...
accounts for a room with a toilet but no shower)
Square footage of the home interior living space

sqft_living 1180; 2570
(size of the living area in feet²)
sqft_lot Square footage of the land space (lot size in feet²) 5650; 7242; ...
floors Total floors (levels) in house 1; 2; ...
A dummy variable for whether the apartment was

waterfront overlooking the waterfront or not ('1' if the 0; 1; ...
property is on the waterfront, '0' if not.)
An index from 0 to 4 of how good the view of the

property is (imagine 0 for a property with a view of
view 0; 2; ...
a dirty alley and 4 for a property with a view of a
beautiful park)
How good the condition is overall, with values

condition 1; 5; ...
from 1 to 5 (1 = bad; 5 = perfect)
An index from 1 to 13, where 1-3 falls short of

building construction and design, 7 has an average
level of construction and design, and 11-13 have a
grade high-quality level of construction and design. 7; 6; ...
(Classification by the quality of the material of the
house. Buildings with better materials usually cost
more)
The square footage of the interior housing space

sqft_above 1180; 2170; ...
that is above ground level (in feet2)
The square footage of the interior housing space

sqft_basement 0; 400; ...
that is below ground level (in feet2)
yr_built The year the house was initially built 1955; 1951; ...
The year of the house’s last renovation ('0' if never

yr_renovated 0; 1991; ...
renewed)
zipcode 5-digit zip code 98178; 98125; ...
lat Latitude coordinate 47.5112; 47.721; ...

long Longitude coordinate -122.257; -122.319; ...
The square footage of interior housing living space

sqft_living15 1340; 1690; ...
for the nearest 15 neighbors
The square footage of the land lots of the nearest

sqft_lot15 5650; 7639; ...
15 neighbors
Analyze by describing data

Pandas also helps describe the datasets answering following questions early in our project.
Which features are categorical?
These values classify the samples into sets of similar samples. Within categorical features are the
values nominal, ordinal, ratio, or interval based? Among other things this helps us select the
appropriate plots for visualization.
 Nominal variables: A categorical variable which has no order.
 Ordinal variables: A categorical variable whose value can be logically ordered or ranked.
(David, 2014)
Therefore, we will classify the variables into below categories
 Nominal variables: ['lat', 'long', 'zipcode']

 Ordinal variables: ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade']
Which features are numerical?
These values change from sample to sample. Within numerical features are the values discrete,
continuous, or timeseries based? Among other things this helps us select the appropriate plots for
visualization.
 Continuous variables: A numeric variable that takes any value between a certain set of real
numbers.
 Discrete variables: A numeric variable that can only take distinct and separate values.
(David, 2014)
Thus, we will classify the variables into below categories
 Continuous: ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15',

'sqft_lot15']
 Discrete: ['yr_built', 'yr_renovated']
Which features contain blank, null or empty values?
We can check for missing values with pandas isnull() (VanderPlas, 2016). This indicates whether
values are missing or not. Then we can sum all the values to check every column. Dataframe has
21613 rows and 21 columns. There aren't any missing values in any column.
Figure 2: Check if the dataset contains null value or not
What are the data types for various features? and What is the distribution of numerical feature
values across the samples?
 Five features are floats, fifteen are integers and one is an object.
Figure 3: Check the datatype of all variables in dataset

We can utilize the describe() method to return description of the data in the Data Frame. If the
Data Frame contains numerical data, the description contains this information for each column:
 count - The number of not-empty values.
 mean - The average (mean) value.
 std - The standard deviation.
 min - the minimum value.
 25% - The 25% percentile*.
 max - the maximum value.
*Percentile meaning: how many of the values are less than the given percentile.(VanderPlas, 2016)
Figure 4: The distribution of numerical feature values across the samples
The block diagram given below is the summary of the methodology followed in the paper.
Figure 5: The summary of the methodology

Assumptions based on data analysis
We arrive at following assumptions based on data analysis done so far. We may validate these
assumptions further before taking appropriate actions.
 Correlating: We want to know how well does each feature correlate with Price as well as
do this early in our project and match these quick correlations with modelled correlations
later in the project.
 Completing: Since there are no missing values, we do not need to complete any values.
 Correcting: Id feature may be dropped from our analysis since it does not add value. Date
feature may be dropped since we are going to do feature engineering and make a year and
month column.
 Creating: We may want to create a new feature called Year based on Date to analyze the
price change throughout the years. We may want to create a new feature called Month
based on Date to analyze the price change throughout the months.
2.2 Regression models:

When the relationship between the dependent variable and the independent variable is linear, the
technique is simple linear regression. Simple linear regression is a statistical method you can use
to study relationships between two continuous (quantitative) variables:
1. independent variable (x) – also referred to as predictor or explanatory variable
2. dependent variable (y) – also referred to as response or outcome
The goal of any regression model is to predict the value of y (dependent variable) based on the
value of x (independent variable). In case of linear regression, we would be using past relationships
between x and y (which we call our training data) to find a linear equation of type y = A + Bx and
then use this equation to make predictions. A linear regression equation takes the same form as
the equation of a line and is often written in the following general form: y = A + Bx. Where ‘x’ is
the independent variable (the diameter) and ‘y’ is the dependent variable (the predicted price). The
letters ‘A’ and ‘B’ represent constants that describe the y-axis intercept and the slope of the line.
To find the equation of line, we would need to use the below formula to get A and B. Using training
data to learn the values of the parameters for simple linear regression that produce the best fitting
model is called ordinary least squares or linear least squares (Massaron and Boschetti, 2016).
(∑ 𝑦)(∑ 𝑥 ) − (∑ 𝑥)(∑ 𝑥𝑦) (1)

𝐴=
𝑛(∑ 𝑥 ) − (∑ 𝑥)
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) (2)

𝐵=
𝑛(∑ 𝑥 ) − (∑ 𝑥)
The objective of simple linear regression (which we shall call regression analysis) is to represent
the relationship between values of x and y with a model of the form shown in Equation below
Simple Linear Regression Model (Population Model)
𝑦 =𝛽 +𝛽 𝑥+𝜀 (3)
where:
o 𝑦 = Value of the dependent variable

o 𝑥 = Value of the independent variable
o 𝛽 = Population’s y intercept
o 𝛽 = Slope of the population regression line
o 𝜀 = Random error term
The simple linear regression population model described in Equation 3 has four assumptions:
1. Individual values of the error terms, 𝜀, are statistically independent of one another, and
these values represent a random sample from the population of possible 𝜀 values at each
level of 𝑥.
2. For a given value of 𝑥, there can exist many values of y and therefore many values of 𝜀.
Further, the distribution of possible 𝜀 values for any 𝑥 value is normal.
3. The distributions of possible e values have equal variances for all values of 𝑥.
4. The means of the dependent variable, 𝑦, for all specified values of the independent variable,
𝜇 | , can be connected by a straight line called the population regression model.(David,
2014)
Figure 6: Graphical Display of Linear Regression Assumptions
Figure above illustrates assumptions 2, 3, and 4. The regression model (straight line) connects the
average of the 𝑦 values for each level of the independent variable, 𝑥. The actual 𝑦 values for each
level of 𝑥 are normally distributed around the mean of y. Finally, observe that the spread of
possible y values is the same regardless of the level of 𝑥. The population regression line is
determined by two values, 𝛽 and 𝛽 . These values are known as the population regression
coefficients. 𝛽 identifies the y intercept and 𝛽 the slope of the regression line. Under the
regression assumptions, the coefficients define the true population model. For each observation,
the actual value of the dependent variable, y, for any x is the sum of two components:
H o u s e P r i c e P r e d i c t i o n | 10
𝑦 =𝛽 +𝛽 𝑥 +𝜀
Linear component Random error component
The random error component, 𝜀, may be positive, zero, or negative, depending on whether a single
value of 𝑦 for a given 𝑥 falls above, on, or below the population regression line.(David, 2014)
2.3 Python and libraries:

2.3.1 Python:
Given the availability of many useful packages for creating linear models and given the fact that
it is a programming language quite popular among developers, Python is my language of choice
for all the code presented in this research project. Created in 1991 as a general-purpose, interpreted,
object-oriented language, Python has slowly and steadily conquered the scientific community and
grown into a mature ecosystem of specialized packages for data processing and analysis. It allows
you to perform uncountable and fast experiments, easy theory development, and prompt
deployments of scientific applications.(Massaron and Boschetti, 2016)
Figure 7: Python for data science
As a developer, I have found using Python interesting for various reasons:
 It offers a large, mature system of packages for data analysis and machine learning. It
guarantees that you will get all that I need in the course of a data analysis, and sometimes
even more.
 Although interpreted, it is undoubtedly fast compared to other mainstream data analysis
languages such as R and MATLAB (though it is not comparable to C, Java, and the newly
emerged Julia language)
 There are packages that allow me to call other platforms, such as R and Julia, outsourcing
some of the computations to them and improving your script performance. Moreover, there
are also static compilers such as Cython or just-in-time compilers such as PyPy that can
transform Python code into C for higher performance.
 It can work better than other platforms with in-memory data because of its minimal
memory footprint and excellent memory management. The memory garbage collector will
often save the day when I load, transform, dice, slice, save, or discard data using the various
iterations and reiterations of data wrangling.
2.3.2 Python libraries and packages:
Linear models diffuse in many different scientific and business applications and can be found,
under different functions, in quite a number of different Python packages. I have selected a few
for use in this research project including NumPy, Pandas, Seaborn, Matplotlib, Statsmodels,
Folium. Among them, Statsmodels is my choice for illustrating the statistical properties of models,
and Scikit-learn is instead the package we recommend for easily and seamlessly preparing data,
building models, and deploying them. We will present models built with Statsmodels exclusively
to illustrate the statistical properties of the linear models, resorting to Scikit-learn to demonstrate
how to approach modeling from a data science point of view. I also take full advantages of folium
package to implement the map visualization for this project.
NumPy: NumPy, which is Travis Oliphant's creation, is at the core of every analytical
solution in the Python language. It provides the user with multidimensional arrays, along
with a large set of functions to operate multiple mathematical operations on these arrays.
Arrays are blocks of data arranged along multiple dimensions and that implement
mathematical vectors and matrices. Arrays are useful not just for storing data, but also for
fast matrix operations (vectorization), which are indispensable when you wish to solve ad
hoc data science problems. (Massaron and Boschetti, 2016)
Statsmodels: Previously part of Scikit, Statsmodels has been thought to be a complement
to SciPy statistical functions. It features generalized linear models, discrete choice models,
time series analysis, and a series of descriptive statistics as well as parametric and
nonparametric tests. In Statsmodels, we will use the statsmodels.api and
statsmodels.formula.api modules, which provide functions for fitting linear models by
providing both input matrices and formula's specifications (Massaron and Boschetti, 2016).
Matplotlib: Matplotlib is one of the most popular Python packages used for data
visualization. It is a cross-platform library for making 2D plots from data in arrays. It
provides an object-oriented API that helps in embedding plots in applications using Python
GUI toolkits such as PyQt, WxPythonotTkinter. It can be used in Python and IPython
shells, Jupyter notebook and web application servers also (Nelli, 2015).
Pandas: Pandas is an open-source Python library for highly specialized data analysis.
Currently it is the reference point that all professionals using the Python language need to
study and analyze data sets for statistical purposes of analysis and decision making. This
library has been designed and developed primarily by Wes McKinney starting in 2008;
later, in 2012, Sien Chang, one of his colleagues, was added to the development. Pandas
arises from the need to have a specific library for analysis of the data which provides, in
the simplest possible way, all the instruments for the processing of data, data extraction,
and data manipulation. This Python package is designed on the basis of the NumPy library.
So, pandas have as its main purpose to provide all the building blocks for anyone
approaching the world of data analysis (Nelli, 2015).
Seaborn: It is a library built on prime of Matplotlib. It allows one to make their
visualizations prettier, and provides us with some of the common data visualization needs
(like mapping a color to a variable or using faceting). Seaborn is more integrated for
working with Pandas DataFrames (Nelli, 2015).
Folium: It shows how to create a Leaflet web map from scratch with Python and the Folium
library. That should generate a map.html file. Later, you can simply put that HTML file on
a live server and have the map online (Pandey et al., 2020).
Here is my source code to import all packages and libraries that need for this project
# Step 1: Import all packages and libraries in Python that need for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from scipy import stats
# Theses are packages and libraries that for map visualization
import folium
from folium.plugins import HeatMap
from folium.plugins import FloatImage
3. SIMULATION SETUPS AND RESULT EVALUATIONS:

3.1 Simulation setups:
3.1.1 Correlation between variables:
In order to analyze the relationship between two variables graphically, we can also measure the
strength of the linear relationship between two variables using a measure called the correlation
coefficient. (David, 2014)
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦) (4)

𝑟=
[∑(𝑥 − 𝑥̅ ) ][∑(𝑦 − 𝑦) ]
or the algebraic equivalent:
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) (5)

𝑟=
[𝑛(∑ 𝑥 ) − (∑ 𝑥 )][𝑛(∑ 𝑦 ) − (∑ 𝑦 )]
where:
o r = Sample correlation coefficient
o n = Sample size
o x = Value of the independent variable
o y = Value of the dependent variable
The sample correlation coefficient computed using Equations 1 and 2 is called the Pearson product
moment correlation. The sample correlation coefficient, r, can range from a perfect positive
correlation, +1.0, to a perfect negative correlation, -1.0. A perfect correlation occurs if all points
on the scatter plot fall on a straight line. If two variables have no linear relationship, the correlation
between them is 0 and there is no linear relationship between the x and y variables. Consequently,
the more the correlation differs from 0.0, the stronger the linear relationship between the two
variables. The sign of the correlation coefficient indicates the direction of the relationship. Once
again, for the correlation coefficient to equal plus or minus 1.0, all the (x, y) points must form a
perfectly straight line. The more the points depart from a straight line, the weaker (closer to 0.0)
the correlation is between the two variables. (David, 2014)
The sign of the coefficient indicates the direction of the relationship. If both variables tend to
increase or decrease together, the coefficient is positive, and the line that represents the correlation
slopes upward. If one variable tends to increase as the other decreases, the coefficient is negative,
and the line that represents the correlation slopes downward. (Soffritti et al., 2011). We can see
there are no missing values in the data and now we will find correlation of various variables with
one another. We will use Pearson correlation coefficient. Correlation is a quantitative assessment
that measures both the direction and the strength of this tendency to vary together. A number closer
to 1 indicates strong correlation while a number of closer to 0 indicates weak correlation. A
positive coefficient means proportional relation, i.e. The variable 2 moves in the same direction as
variable 1, with which we are comparing. Apart from seeing it numerically, we can also see it
visually with the help of matplotlib and seaborn libraries in Python.
Figure 8: Pearson heatmap correlation
Price correlation
 This allow us to explore labels that are highly correlated to the price. According to the
correlation matrix (Figure 7), “sqft_living15” has a high correlation with “bathrooms”
(0.57), “sqft_living” (0.76), “grade” (0.71), and “sqft_above” (0.73).
Which features are more correlated to the price?
Figure 9: Ranking for all other variables correlation with price
Price feature
 Most of the house prices are between $0 and $1,500,000.
 The average house price is $540,000.
 Keep in mind that it may be a good idea to drop extreme values. For instance, we could
focus on house from $0 to $3,000,000 and drop the other ones.
 It seems that there is a positive linear relationship between the price and sqft_living.
 An increase in living space generally corresponds to an increase in house price.
3.1.2 Distribution of target variable and pairplot:
Figure 10: Distribution of target variable price
 Skewness: 4.024069
 Kurtosis: 34.585540
Skewness is the degree of distortion from the symmetrical bell curve or the normal distribution.
The above distribution curve shows a positive skewness. i.e., the peak of the distribution curve is
less than the average value. This may be an indication that many houses are sold at less than the
average value. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to
a normal distribution. High Kurtosis (34.585540) in this case may be because of outliers present
in the data.(Groeneveld and Meeden, 1984, Bonato et al., 2022)
The continuous columns distributions are skewed which means they need to be normalized. To
improve the linear regression model's performance the data distribution is normalized for
continuous data types by calculating the log for each column.
Figure 11: Histogram
Visualizing data on pairplot with sns.pairplot() in Python

When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This
is very useful for exploring correlations between multidimensional data, when you'd like to plot
all pairs of values against each other (VanderPlas, 2016). I will demo this with the well-known
King County dataset, which lists measurements of variables distribution to price:
Figure 12: Pairplot
 There is a lot of correlation between the living area('sqft_living') and the other
variables like the size of the lot('sqft_lot'), basement('sqft_basement') and the living area
and plot size of the nearest 15 houses ('sqft_living15' and 'sqft_lot15' respectively).
 A large living area would definitely mean more room for basement and a larger lot. As
evident from the correlation heatmap, a larger living area is also strongly correlated with
the number of bedrooms, bathrooms and floors.
 Houses that are close to each other are similar in area and similarly priced (refer the
heatmap). This can be attributed to the clustering of similar houses into a
neighborhood. More affluent neighborhoods will also have similar prices. There aren't as
many exceptions in this case.
3.1.3 Visualizing data on box plots

Box plot is a method for graphically depicting groups of numerical data through their quartiles.
Box plots may also have lines extending from the boxes (whiskers) indicating variability outside
the upper and lower quartiles, hence the terms box-and-whisker plot. Outliers may be plotted as
individual points. The spacings between the different parts of the box indicate the degree of
dispersion (spread) (Massaron and Boschetti, 2016). Moreover, box plot (or box-and-whisker plot)
shows the distribution of quantitative data in a way that facilitates comparisons between variables
or across levels of a categorical variable. The box shows the quartiles of the dataset while the
whiskers extend to show the rest of the distribution, except for points that are determined to be
“outliers” using a method that is a function of the inter-quartile range. We can see outliers plotted
as individual points; this probably are the more expensive houses.
Figure 13: Box plot definition
 Minimum: The minimum value in the given dataset

 First Quartile (Q1): The first quartile is the median of the lower half of the data set.
 Median: The median is the middle value of the dataset, which divides the given dataset
into two equal parts. The median is considered as the second quartile.
 Third Quartile (Q3): The third quartile is the median of the upper half of the data.
 Maximum: The maximum value in the given dataset. Apart from these five terms, the
other terms used in the box plot are:
 Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1
 Outlier: The data that falls on the far left or right side of the ordered data is tested to be
the outliers. Generally, the outliers fall more than the specified distance from the first and
third quartile. (i.e.) Outliers are greater than Q3+(1.5. IQR) or less than Q1-(1.5.
IQR).(Fitrianto et al., 2022)
Bedrooms and floors box plots:
Figure 14: Bedrooms and floors box plots
 Bedrooms: The median house price is going up with increase in the number of bedrooms
(up to 7) and bathrooms (up to 5). Thereafter it doesn't show a linear trend. Most of the
houses have an average of 3 to 4 bedrooms. The horizontal line across the boxes denotes
the median price, the lower half and the upper half represent the 25th and the 75th quartile
respectively. The vertical line is the typical range and outliers above the range are
represented as individual points.
 There are a lot of outliers interestingly. A house with no bedrooms might be a studio
apartment and there is one with 33 bedrooms. Noticeably, it is modestly priced in
comparison to some other houses. For any given number of bedrooms, there are a lot of
outliers which don’t fall under the range. This can be because certain houses in the Greater
Seattle Area might naturally be more expensive than a suburban house with more no of
bedrooms. Thus, this is not the only variable that explains price. We have to take into
account the zipcode, size and other variables as well.
 As expected, although there is a positive correlation between these two variables, there isn't
an obvious trend. Penthouses and loft apartments in downtown Seattle might definitely be
more expensive than a three-story suburban colonel. The price of houses increases for
houses with 0-2.5 (around 3) floors and then subsequently decreases. Houses with a greater
number of floors have higher price.
Waterfront, view and grade box plots:
Figure 15: Waterfront, view, grade box plots
 Waterfront houses tends to have a better price value.

 The price of waterfront houses tends to be more disperse and the price of houses without
waterfront tend to be more concentrated. It can be observed that the median price of houses
with water front is greater than that of houses without waterfront. But the maximum house
price is for a house without waterfront.
 Grade and waterfront effect price. View seems to effect less but it also has an effect on
price. It is easy to see an obvious upward trend in the graph. Better construction materials
increase and the cost of labor and raw materials which is reflected by the price. Although
we can't deny some exceptions, they definitely aren't as deviant as some of the variables
that we have seen earlier.
Condition and bathrooms box plots:
Figure 16: Bathrooms and condition box plots
 We observe that a better condition doesn't necessarily imply a higher price. Again, it would
be too broad to generalize anything given the number of outliers. The median price is the
same for all condition ratings but the maximum price for each condition increases up to 4th
rating and then decreases.
 It is interesting to note that houses in a mediocre condition(grade=3) have a lot of outliers.
It is an interesting point for further exploration by the concerned users. Again, as in the
case of the number of bedrooms, the number of bathrooms and the price varies a lot. Some
houses with 2 bathrooms are priced way more than some with 4 or 5.
 We might be tempted to think that the number of bathrooms is the price show an increasing
trend. We have to keep in mind that this dataset primarily comes from an urban and a
suburban region. Exceptions are therefore bound to occur. The house of houses with a
greater number of bathrooms are higher but it kind of plateaus near 7-8 bathrooms
3.1.4 Scatter plot between Price and other variables to view the relationship between them:
Figure 17: Scatter plot demonstrating the association between other variables and price
 The price is low when the number of bedrooms is very low or very high. The price seems
to be high when the number of bedrooms is around 5
 A conclusion similar to the previous one can be made for the relationship between floors
and price
 The maximum price of houses (for each value of number of bathrooms) increases as
number of bathrooms increases.
 The prices are almost similar irrespective of the number of times the house has been
viewed. But the highest prices are for those that have been viewed 2 or more times
 There seems to be a positive relationship between square footage apart from basement,
square footage of basement, living area and price i.e., the price increases as the value of
these variables increase.
 The maximum price of houses (for each lot size) decreases as lot size increases.
 There seems to be no relationship between the age of the building and the price.
3.1.5 How seasonality can affect house prices:
 Looking the box plots below, we notice that there is not a big difference between 2014 and
2015. The number of houses sold by month tends to be similar every month. The line plot
show that around April there is an increase in house prices.
Figure 18: Years and months box plots

Figure 19: Trend line of the month and price
This question aims to answer which season and month is more affordable. In order to answer this
question, the 'month' and 'season' columns which are previously created and added to the
DataFrames are used. Furthermore, groupby() function is used to group data in months and
seasons. With the help of mean() function mean for each season and month is reached.
Figure 20: Average price by months

Figure 21: Average price by seasons
As predicted the best season to buy a house is winter. Fall has the highest price range followed by
summer and spring. Spring has the highest average price followed by summer and fall. The average
prices for each season are very close. The median values for each season are close, too. Season is
not genuine as expected to be used as a predictor in the regression model. While October has the
highest variance, February has the lowest variance. February has the lowest, April has the highest
price average. There isn't the big difference between the median values and average price for
months. Month is also not a good predictor for the regression model.
3.2 Linear regression model:

3.2.1 Linear regression model
When we model a linear relationship between a response and just one explanatory variable, this is
called simple linear regression. I want to predict house prices and then, our response variable is
price. However, for a simple model we also need to select a feature. When I look at the columns
of the dataset, living area (sqft) seemed the most important feature. When we examine the
correlation matrix, we may observe that price has the highest correlation coefficient with living
area (sqft) and this also supports my opinion. Thus, I decided to use living area (sqft) as feature
but if you want to examine the relationship between price and another feature, you may prefer that
feature.
Figure 22: OLS regression result
I also printed the intercept and coefficient for the simple linear regression. By using these values
and the below definition, we can estimate the house prices manually. The equation we use for our
estimations is called hypothesis function and defined as
𝑦 =𝜃 +𝜃 𝑥 (6)
Since we have just two dimensions at the simple regression, it is easy to draw it. The below chart
determines the result of the simple regression. It does not look like a perfect fit but when we work
with real world datasets, having a perfect fit is not easy.
Figure 23: Linear regression model
With these graphs we can observe the following information:
 The value of residences tends to increase according to the number of rooms, but from 6
rooms onwards this value tends to fall, which may mean that the number of rooms from 6
onwards does not have much influence on the price, which confirms the information from
the matrix that correlation made earlier;
 The number of bathrooms has a direct positive influence on the price of homes, although
there is a lot of dispersion of data from 5 bathrooms. And again you can see that this number
of toilets has float numbers, which is kind of weird. Probably an error at the time of data
collection;
 sqft_living also has a direct influence, despite the large dispersion from 7000;
 As already seen, "grid" also has a positive influence;
 sqft_above and sqft_living15 have a positive influence, although both have dispersion from
5000.
3.2.2 Important features for determining the house price:
This section will explore what features bring the most influence to the outcome of the model. The
graph in the figure below shows feature importance generated through Linear regression model.
Figure 24: Features ranking
Some features that get a high score in feature importance are worth discussing. The first one is
location. Note that although latitude ranks the first among all features, longitude also gets a high
score and is supposed to be taken into consideration. After all, the combination of latitude and
longitude represents the location of the house, thus influencing the house price. This highly
conforms to real life. Location always serves as the determining factor for house price. An example
for this argument prevails, for instance, a place that is convenient with public transportation is
usually sold for a higher price. Likewise, a place near parks or lakes is priced highly for its
surroundings. The second important factor is the area of living space. It is not surprising for
‘sqft_living’ to come to this place because when houses are sold, they are priced a certain amount
of money per square meter or square foot. Therefore, the larger the house, the greater amount of
money customers need to pay. The third important factor is the grade of the house. It is reasonable
that a house in good condition will be more attractive to consumers, resulting in a higher price for
sale. The feature importance outcome in this case, in general, is highly compatible with our
consensus. House sellers ought to pay more attention to these features to gain more revenue and
attract customers.
Homes with the following description tend to have better sales prices:
 Number of rooms: 4
 Number of bathrooms: 3
 Living area: up to 60000 sq.
 Quality of materials: 7 or 8
 Area above ground: up to 60000 ft²
 Number of floors: 1 or 2
So, if an acquired home has a description below that, a renovation with these changes can increase
the selling price.
3.3 Result evaluations

After preprocessing the data, we fit the data into the models and acquired the outcome. With the
purpose of evaluating the models, we picked several statistical indicators. The first indicator is the
Root Mean Square Error (RMSE). While R Square is a relative measure of how well the model
fits dependent variables, Mean Square Error is an absolute measure of the goodness for the fit.
MSE is calculated by the sum of square of prediction error which is real output minus predicted
output and then divide by the number of data points. It gives you an absolute number on how much
your predicted results deviate from the actual number. Root Mean Square Error (RMSE) is the
square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value
can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus
square root brings it back to the same level of prediction error and make it easier for interpretation
(David, 2014). It can be utilized to measure the precision of a regression model. The way that
RMSE is calculated is written as:
(7)
1
𝑅𝑀𝑆𝐸(𝑋, ℎ) = (ℎ(𝑋 ( ) ) − 𝑦( ))
𝑚
where m is the number of instances in the dataset, 𝑋(i) is a vector of all feature values of the
instance, 𝑦(i) is the target value for each instance, 𝑋(i) is a matrix containing all feature values and
h is the system’s prediction function.
The second indicator that we picked is R-squared. The greater the value of R-squared, the better
the model fits. R Square measures how much of variability in dependent variable can be explained
by the model. It is square of Correlation Coefficient(R) and that is why it is called R Square. R
Square is calculated by the sum of squared of prediction error divided by the total sum of square
which replace the calculated prediction with mean. R Square value is between 0 to 1 and bigger
value indicates a better fit between prediction and actual value. R Square is a good measure to
determine how well the model fits the dependent variables (David, 2014). R-squared is calculated
in the following way:
∑ (𝑦 − 𝑦 ) (8)
𝑅 = 1−
∑ (𝑦 − 𝑦 )
where 𝑅 2 is the value of R-squared, 𝑦i is the true value of target observation, 𝑦 is the predicted
value, and 𝑦 is the mean value for the target vector.
Third, we selected the value of adjusted R-squared (adjusted 𝑅2) because there is a problem with
𝑅2: when the total number of features increases, the 𝑅2 will also increase, regardless of whether
the variable is indeed closely related to the target variable (Massaron and Boschetti, 2016).
Adjusted 𝑅2 can be denoted as:
∑ (𝑦 − 𝑦 ) /(𝑛 − 𝑝 − 1) (9)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 −
∑ (𝑦 − 𝑦 ) /(𝑛 − 1)
where p is the number of variables and n is the number of instances.

Mean Absolute Error (MAE): Mean Absolute Error (MAE) is similar to Mean Square Error
(MSE). However, instead of the sum of square of error in MSE, MAE is taking the sum of absolute
value of error. Compared to MSE or RMSE, MAE is a more direct representation of sum of error
terms. MSE gives larger penalization to big prediction error by square it while MAE treats all
errors the same (David, 2014).
1 (10)
𝑀𝐴𝐸 = (𝑦 − 𝑦 )
𝑛
Model Evaluation Results:
Root
Adjusted
Mean Mean
R-squared R- Mean Squared
Model Squared Absolute
(training) squared Error (MSE)
Error Error (MAE)
(training)
(RMSE)
Simple
254289. 28102062180. 104004.95633
Linear 0.658 0.657
149 86872 36193
Regression
4. MAP VISUALIZATION:
Location is known to be a main determinant in people's efforts to estimate the value of a house.
However, the type of location, and subsequently how the value of that location is estimated, has
not been investigated to the same degree (Heyman and Sommervoll, 2019). The location of a
residential property in a city directly affects its market price. Each location represents different
values in variables such as accessibility, neighborhood, traffic, socio-economic level and
proximity to green areas, among others. In addition, that location has an influence on the choice
and on the offer price of each residential property. Therefore, thanks to folium package in Python,
I have implemented map visualization as below to answer the question How Does Location Affect
House Prices?
Figure 25: Scatter plot of King County shape
The above scatter plot is almost the shape of King County. It can be seen that higher priced houses
are located in some specific regions, especially near the coasts. Specifically, the high-priced houses
are located between latitudes of 47.5o and 47.7o and longitudes of -122.0o and 122.4o. This also
indicates that geographical location (latitude, longitude) is a key factor that decides house price.
To visualize the most expensive areas of King County, longitude and latitude data used with a
scatter plot. Here it can be observed that in general northern part of the city and the houses
surrounding Lake Washington has higher prices.
Figure 26: Scatter plot with longitude and latitude of King County
Figure 27: House price with map visualization

Figure 28: Highest and lowest house price with heatmap
Figure 29: Average price by each zipcode (area)

As recognized from the above maps Northwest and Northeast parts of the areas average house
prices are higher from rest of the county. At this point grouping the latitude and longitude
according to their cardinal directions to can prove the assumptions based on the maps. In order to
create directions for each geographical data point, the 'lat' and 'long' are zipped together. Then, the
list is used to iterate over the latitude and longitude with the help of a for loop. Later, the 'direction'
column can be used as a predictive for the regression analysis. It can be observed that most of the
houses facing the ocean or a lake have high prices which was evident from the waterfront variable
as well.
Figure 30: Average price by direction

Figure 31: Scatter plot house price by direction
As expected, the northwest of the county has the highest price and highest variance. Northeast is
the second in terms of highest price and variance. Southwest and Southeast parts of the county
have lowest prices. The average price for northeast and northwest are almost the same. The average
price of southeast is higher than southwest. Direction may not be effective as zip codes as a
predictor.
5. CONCLUSIONS AND FUTURE RESEARCHES:

5.1 Conclusions:
In this paper, the issue of house price prediction is explored using a case from King County in the
United States. In order to eliminate the problems that exist in the original dataset, this paper not
only minorizes the extreme values in numerical features like ‘price’, but also calculates the
correlation coefficient and removes the highly correlated features including ‘sqft_living15’ and
‘sqft_lot15’, to assure with a precise prediction. The linear regression model and corresponding
essential factors are subsequently derived through Python coding. Comparison among related
literature is also conducted to complete a further discussion of the topic. From the research, we
obtain the following conclusions. First, Linear regression serves as one of the best models for our
house price prediction. Second, the most important factors in the microscope that influence the
house prices are location, living space and the condition of the house. Such a finding highly
conforms to our common sense.
The innovations of this essay are summarized as follows. First and foremost, this essay adopts
Linear Regression Model to predict house prices. This approach achieves better prediction
precision compared to extant research papers on the same issue. compared to extant research
papers on the same issue. In addition, this essay focuses on the house price prediction from a
microscope rather than macro scope which is used by more scholars. This brings about an essential
supplement to research on the house price prediction. Moreover, all the steps required for the
successful completion of the house price prediction system have been completed. It is seen that
the multiple linear regression model is suitable for the purpose of predicting house prices.
Despite the merits above, this essay still bears some slight drawbacks. First, this paper does not
cover the macroeconomic factors. If they were taken into consideration, the results might be closer
to real-life situations. Besides, this paper conducts a case study of King County of the US.
However, for other areas that are not similar to King County, additional study is probably needed.
5.2 Further discussion and future researches:

This paper mainly focuses on the prediction of house prices from a scope that is comparatively
micro. Properties of the houses are utilized to determine how much the houses are priced and what
the important factors are in influencing house prices. A similar approach to searching for the
determinant factors of house prices has also been used in a good sum of papers. A comparison and
contrast of the findings of papers will be conducted in this section.
According to Mathur, who also conducted a survey of house prices in King County of the United
States, the size and quality of a certain house matter the most for determining the house price.
Bigger size and better quality will bring about a higher estimated value for assessors. Such a
finding highly conforms to our outcome, where living space and grade rank among the top three.
He also finds that a higher level of maintenance will make the house appreciate. Scholars studying
other areas also contribute to this topic. Zakaria and Fatine conducted research on determinants of
real estate’s price in Morocco. They find out that two factors most significantly determine the
house price, which is a surface area as well as the location of the real estate (Wang and Zhao,
2022). These two factors rank the second and the first, respectively, in our finding. Selim did
research on Advances in Economics, Business and Management Research, volume 648 1553
figuring out the determinants of house prices in Turkey. Taking even more properties into account,
he concludes that the condition of the water system, whether the house has a swimming pool, and
the type of the house (what material the house is made of) are the most important factors (Selim,
2009). These factors seem to obviate the previous findings. However, if inspected carefully, these
factors are, to some extent, related to the grade of a house. Besides, he mentions that the number
of rooms and the locational characteristics is also important. These factors are compatible with our
findings in this paper.
All literature mentioned above solves the problem of house price prediction and important factor
determination from a microscope. Extant literature effectively attests to the validity of our paper’s
findings. Though there exist some slight differences, the general outcome is quite similar. House
location, the space for living, as well as the condition of the house, are indeed among the most
essential features from a microscope to determine how a certain accommodation will be priced.
There are some improvements and additions which can be done. The first is the ability to increase
and update the dataset on a regular basis. This will make the prediction system more correct and
accurate. Furthermore, in the future, I will apply more regression to make more accurate prediction
result such as Catboost, LightGBM, XGBoost, Random Forest and polynomial regression.
6. EXPECTED OUTCOMES
The result of this research can be used in a development of a contextual suggestion system, which
collects user’s current context such as location, personal preferences, incomes, affordability to
automatically offer recommendation of suitable houses without user’s explicit input. However,
this is not the main purpose of the research. As mentioned in the Chapter 5: “Progress so far”, our
main goal is to write a thesis as a submission to University of Greenwich. While writing this thesis,
I hope to produce another paper to submit to ADCS. Furthermore, if we can achieve a good result,
one more paper will be written to submit to SIGIR. The last paper is our most expected outcome
since it can be used to apply for a Master scholarship.
ACKNOWLEDGMENTS:
Copyright © 2022 by the authors. This is an open-access article distributed under the Creative
Commons Attribution License which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited (CC BY 4.0).
REFERENCES
BONATO, M., CEPNI, O., GUPTA, R. & PIERDZIOCH, C. J. J. O. F. 2022. Forecasting realized
volatility of international REITs: The role of realized skewness and realized kurtosis. 41,
303-315.
COOPER, D. J. R. O. E. & STATISTICS 2013. House price fluctuations: the role of housing
wealth as borrowing collateral. 95, 1183-1197.
DAVID, F. G. 2014. Business Statistics: a decision-making approach, Pearson.
FEW, S. & EDGE, P. J. V. B. I. N. 2009. Introduction to geographical data visualization. 2.
FITRIANTO, A., MUHAMAD, W. Z. A. W., KRISWAN, S., SUSETYO, B. J. A. I. J. O. S. &
TECHNOLOGY 2022. Comparing Outlier Detection Methods using Boxplot, Generalized
Extreme Studentized Deviate, and Sequential Fences. 11.
GROENEVELD, R. A. & MEEDEN, G. J. J. O. T. R. S. S. S. D. 1984. Measuring skewness and
kurtosis. 33, 391-399.
HEYMAN, A. V. & SOMMERVOLL, D. E. 2019. House prices and relative location. Cities, 95,
102373.
KOSE, M. A., OTROK, M. C., HIRATA, M. H. & TERRONES, M. M. 2013. Global House Price
Fluctuations: Synchronization and Determinants. International Monetary Fund.
MASSARON, L. & BOSCHETTI, A. 2016. Regression analysis with Python, Packt Publishing
Ltd.
MATHUR, S. J. H. & SOCIETY 2019. House price impacts of construction quality and level of
maintenance on a regional housing market: Evidence from King County, Washington. 46,
57-80.
MURRAY, P. 2016. Data USA: KING COUNTY, WA COUNTY.
NELLI, F. 2015. Python data analytics: Data analysis and science using PANDAs, Matplotlib and
the Python Programming Language, Apress.
PANDEY, K., PANCHAL, R. J. I. J. O. M. & HUMANITIES 2020. A Study of Real World Data
Visualization of COVID-19 dataset using Python. 4, 1-4.
SELIM, H. J. E. S. W. A. 2009. Determinants of house prices in Turkey: Hedonic regression versus
artificial neural network. 36, 2843-2852.
SINAI, T., HIMMELBERG, C. & MAYER, C. 2005. Assessing High House Prices: Bubbles,
Fundamentals, and Misperceptions, National Bureau of Economic Research.
SOFFRITTI, G., GALIMBERTI, G. J. S. & COMPUTING 2011. Multivariate linear regression
with non-normal errors: a solution based on mixture models. 21, 523-536.
VANDERPLAS, J. 2016. Python data science handbook: Essential tools for working with data, "
O'Reilly Media, Inc.".
WANG, Y. & ZHAO, Q. House Price Prediction Based on Machine Learning: A Case of King
County. 2022 7th International Conference on Financial Innovation and Economic
Development (ICFIED 2022), 2022. Atlantis Press, 1547-1555.

AS1 GCS200400 TranTrungTien BI

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AS1 GCS200400 TranTrungTien BI

Uploaded by

Copyright:

Available Formats

University of Greenwich

Student name: Tran Trung Tien

Figure 1: Price house prediction

V. Model implementation an evaluation result

Table 1: Dataset description

Id Unique identified for a house 7129300520; ...

price Price is prediction target 221900; 538000; ...

bedrooms Number of bedrooms per house 3; 2; ...

Number of bathrooms per bathrooms (where .5

Square footage of the home interior living space

floors Total floors (levels) in house 1; 2; ...

A dummy variable for whether the apartment was

An index from 0 to 4 of how good the view of the

How good the condition is overall, with values

An index from 1 to 13, where 1-3 falls short of

The square footage of the interior housing space

The square footage of the interior housing space

The year of the house’s last renovation ('0' if never

zipcode 5-digit zip code 98178; 98125; ...

lat Latitude coordinate 47.5112; 47.721; ...

long Longitude coordinate -122.257; -122.319; ...

The square footage of interior housing living space

The square footage of the land lots of the nearest

Analyze by describing data

 Nominal variables: ['lat', 'long', 'zipcode']

 Continuous: ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15',

Figure 2: Check if the dataset contains null value or not

Figure 3: Check the datatype of all variables in dataset

Figure 4: The distribution of numerical feature values across the samples

Figure 5: The summary of the methodology

2.2 Regression models:

(∑ 𝑦)(∑ 𝑥 ) − (∑ 𝑥)(∑ 𝑥𝑦) (1)

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) (2)

Simple Linear Regression Model (Population Model)

o 𝑦 = Value of the dependent variable

Figure 6: Graphical Display of Linear Regression Assumptions

2.3 Python and libraries:

Figure 7: Python for data science

As a developer, I have found using Python interesting for various reasons:

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

# Theses are packages and libraries that for map visualization

from folium.plugins import HeatMap

from folium.plugins import FloatImage

3. SIMULATION SETUPS AND RESULT EVALUATIONS:

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦) (4)

or the algebraic equivalent:

𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) (5)

Figure 8: Pearson heatmap correlation

Figure 9: Ranking for all other variables correlation with price

3.1.2 Distribution of target variable and pairplot:

Figure 10: Distribution of target variable price

Figure 11: Histogram

Visualizing data on pairplot with sns.pairplot() in Python

Figure 12: Pairplot

3.1.3 Visualizing data on box plots

Figure 13: Box plot definition

 Minimum: The minimum value in the given dataset

Figure 14: Bedrooms and floors box plots

Figure 15: Waterfront, view, grade box plots

 Waterfront houses tends to have a better price value.