You are on page 1of 5

House Price Prediction Using Machine Learning Techniques

Ashray Kakadiya, Khushal Shingala, Shivraj Sharma


California State University, Sacramento

Abstract ensemble of weak prediction models, typically


decision trees.
Using “Ames Housing dataset” we are With 79 explanatory variables
predicting the sales price of homes in Ames, Iowa describing(almost)
taking various machine learning Approaches like Every aspect of residential homes in Ames, Iowa.
data cleaning, data visualization, feature engineering Our inspiration is to do Comprehensive data
and advanced regression techniques like random exploration and staked regression to increase the
forests and gradient boosting. We Used Python as accuracy of the model. By doing this project we can
Scripting language to use various machine learning better implement the techniques the we learned in
libraries like pandas, Numpy, Scikit learn etc. to the class like data cleaning, regression analysis,
implement our predictor model. ensembling learning, feature engineering etc.

1. Introduction 2. Related Work

Ask a home buyer to describe their There are much related work going on with different
dream house, and they probably won't begin with the prediction models and different type of approaches
height of the basement ceiling or the proximity to an used to implement the prediction models like
east-west railroad. But this predictor’s dataset Xgboost, Gradient boosting, random forests etc. The
proves that much more influences price negotiations reference we used to implement our project used
than the number of bedrooms or a white-picket linear regression to implement his prediction model
fence. by doing data cleaning, data visualization, how
'The most difficult thing in life is to know common factors are affecting the prices of the
yourself'. This quote belongs to Thales of Miletus. houses and then followed by linear regression.
Thales was a Greek philosopher, mathematician and The Approach follows:
astronomer. I wouldn't say that knowing your data is • imported dependencies , for linear
the most difficult thing in data science, but it is regression we used sklearn (built in python
time-consuming. Therefore, it's easy to overlook this library) and import linear regression from it.
initial step and jump too soon into the water. We • then initialized Linear Regression to a
spend most time In understanding the data. variable reg.
Decision trees leave you with a difficult • set labels (output) as price columns and also
decision. A deep tree with lots of leaves will overfit converted dates to 1’s and 0’s so that it
because each prediction is coming from historical doesn’t influence our data much .
data from only the few houses at its leaf. But a • he again imported another dependency to
shallow tree with few leaves will perform poorly split our data into train and test. Made 90%
because it fails to capture as many distinctions in the and 10% as train and test data respectively.
raw data. Even today's most sophisticated modeling And randomized the splitting the data by
techniques face this tension between underfitting using random_state.
and overfitting. But, many models have clever ideas • So now , the train data , test data and labels
that can lead to better performance. So, we first for both let us fit our train and test data into
implemented random forest to see the results. We linear regression model.
used gradient boosting after implementing random • After fitting our data to the model we can
forest which is machine learning technique for check the score of our data ie , prediction. in
regression and classification problems, which this case the prediction is 73%.
produces a prediction model in the form of an
The accuracy of the model is much lower to be
useful, to achieve the satisfactory accuracy we 3.1.4 Noisy Data
decided to choose random forest and gradient Particular fields that contain new information
boosting implementation by doing better data can't be comprehended and translated accurately by
cleaning, data exploration, data visualization on a machines, like unstructured content. For example, in
different data set which covered almost every aspect a dataset, signs had numerous unstructured fields.
that is dependent on house price prediction For example, some had strange symbols that can not
be read by machine.
3. Data Set
3.1.5 Inconsistent Data
searching related dataset that has most of the
variables to predict the house price was hard in this Containing inconsistencies (an absence of
task. Expected data from the dataset was: similarity or comparability between at least two
realities). Recurrence of this sort of information was
It should to have essential and adequate variables to high in every one of the fields where one certainty
frame a composite choice parameter depend on was spoken to in various ways code names, images,
which results can be acquired. abbreviation and so on.

It should not have a high recurrence of clashing


information. It should be in an available and perfect 3.2 Project Data
configuration on which information preprocessing
could be performed. The labeled data set consists of 1460 rows
different 79 variables of details for number of
3.1 Data Issues bedrooms, house reviews, specially selected for
house analysis. The reviews is not
3.1.1 Data Quality binary, meaning the final result will depend on
Certain fields need property estimations, certain height of ceilings or proximity of nearest school or
qualities of interest, and contain just total nearest transportation center, number of bedrooms
information. A portion of the field estimations of the and bathrooms, like all 79 house price prediction
basic decision parameters were absent. Due to this a variables. The 1460 review labeled training set does
portion of the information must be included. not include any of the same house as the 1460
Another issue was that of total information. review test set.

train - The labeled training set. This file has a


3.1.2 Performance header row containing all variable name followed by
Performance containing mistakes and 1460 rows containing locations, bedrooms, and all
exceptions. Since the information was not perfect, it 79 variables prediction.
wasn't possible to accomplish the normal output
without expelling mistakes and exceptions. This was test - The test set. This file has a header row
one of the aspects to consider to get expected followed by 1460 rows containing an same variables
outcomes. for testing. Your task is to predict the price for
house for each one.
3.1.3 Unformatted Text
Some data can be in different format than 4. Methods for data processing
expected. So machine can not determine output. For
example, data should be in integer format but there These methods can be used to process the data for
are some data which are in character format. That optimization of prediction:
will create some issues while training data and
calculating. 4.1 Data cleaning
It is a machine learning technique for regression and
This can be apply on raw data by classification problems, which produces a prediction
normalization, statistical methods, duplicate model in the form of an ensemble of weak
elimination. Data issues specified above, should be prediction models, typically decision trees.
solved to predict the optimized house price. For
example, to improve data quality, we can eliminate Gradient boosting involves three elements: A loss
unrequired variables or variables which can function to be optimized. A weak learner to make
decrease the performance of machine. For noisy or predictions. An additive model to add weak learners
unformatted data, it can be solved by updating to minimize the loss function.
unformatted variables by changing values in proper
format. we will be using the sklearn library for it. we create
a variable where we define our gradient boosting
4.2 Data visualization regressor and set parameters to it , here
n_estimator - The number of boosting stages to
Let’s look at this problem from a builder’s perform. We should not set it too high which would
perspective, sometimes it’s important for a builder overfit our model.
to see which is the highest selling house type which max_depth - The depth of the tree node.
enables the builder to make house based on that. learning_rate - Rate of learning the data.
loss - loss function to be optimized. ‘ls’ refers to
from the visualization 3 bedroom houses are most least squares regression.
commonly sold followed by 4 bedroom. So how is it minimum sample split - Number of sample to be
useful ? For a builder having this data , He can make split for learning the data.
a new building with more 3 and 4 bedroom’s to We then fit our training data into the gradient
attract more buyers. boosting model and check for accuracy.

2. Visualizing the location of the houses based on 5. What we have done


latitude and longitude.
So according to the dataset , we have latitude and We gone through following steps to know our data
longitude on the dataset for each house. We are the best:
going to see the common location and how the 1. Understand the problem. We look at each
houses are placed. variable and do a philosophical analysis
about their meaning and importance for this
We use seaborn , and we get his beautiful problem.
visualization. Joinplot function helps us see the
concentration of data and placement of data and can in order to have some discipline in our
be really useful. Let us see what we can infer from analysis, we created an Excel sheet with
this visualization. For latitude between -47.7 and - following colums:
48.8 there are many houses , which would mean that 1) Variable,
maybe it’s an ideal location isn’t it ? But when we 2) Type(numerical,categorical),
talk about longitude we can see that concentration is 3) Segment(e.g. building, space, location )
high between -122.2 to -122.4. Which would mean 4) Expectation(expections regarding the
that most of the buy’s has been for this particular variable like high, medium, low)
location. 5) Conclusion(importance of the variable)
6)comments(any general comment).
seaborn which is another built in python library used
to do data representation we went through this process and concluded
that the following variables can play an
4.3 Gradient boosting important role in this problem:
OverallQual (which is a variable that we • Scatter plots between the most
don't like because I don't know how it was correlated variables.
computed; a funny exercise would be to
predict 'OverallQual' using all the other At first sight, there are two red colored squares
variables available). that get my attention. The first one refers to the
YearBuilt. 'TotalBsmtSF' and '1stFlrSF' variables, and the
TotalBsmtSF. second one refers to the 'GarageX' variables.
GrLivArea. Both cases show how significant the correlation
we ended up with two 'building' variables is between these variables.
('OverallQual' and 'YearBuilt') and two Another thing that got my attention was the
'space' variables ('TotalBsmtSF' and 'SalePrice' correlations. We can see our well-
'GrLivArea'). This might be a little bit known 'GrLivArea', 'TotalBsmtSF', and
unexpected as it goes against the real estate 'OverallQual' are highly correlated, but we can
mantra that all that matters is 'location, also see many other variables that should be
location and location'. It is possible that this taken into account.
quick data examination process was a bit mega scatter plot gives us a reasonable idea
harsh for categorical variables. about variables relationships.

2. Univariable study: We just focus on the 4. Basic cleaning: We'll clean the dataset and
dependent variable ('SalePrice') and try to handle the missing data, outliers and
know a little bit more about it. categorical variables.
-we analyzed the ‘SalesPrice’ variable and
saw the relationship to its categorical variables like 'PoolQC', 'MiscFeature' and
features and numerical variables. 'FireplaceQu' are strong candidates for
Based on that we concluded that: outliers, so we happily to deleted them.
Regarding 'MasVnrArea' and
1) 'GrLivArea' and 'TotalBsmtSF' seem to 'MasVnrType', we consider that these
be linearly related with 'SalePrice'. Both variables are not essential. Furthermore,
relationships are positive, which means that they have a strong correlation with
as one variable increases, the other also 'YearBuilt' and 'OverallQual' which are
increases. In the case of 'TotalBsmtSF', we already considered.
can see that the slope of the linear we have one missing observation in
relationship is particularly high. 'Electrical'. Since it is just one observation,
2) 'OverallQual' and 'YearBuilt' also seem we'll delete this observation and keep the
to be related with 'SalePrice'. The variable
relationship seems to be stronger in the case Univariate analysis: The primary
of 'OverallQual', where the box plot shows concern here is to establish a threshold that
how sales prices increase with the overall defines an observation as an outlier. To do
quality. so, we standardized the data. In this context,
we convered data values to have mean of 0
3. Multivariate study : We'll try to understand and a standard deviation of 1.
how the dependent variable and independent
variables relate. 5. Test assumptions: We'll check if our data
To explore the universe, we started with meets the assumptions required by most
some practical recipes to make sense of our multivariate techniques.
'plasma soup'((source: According to Hair et al. (2013), four
http://umich.edu/~gs265/bigbang.htm)): assumptions should be tested:
• Correlation matrix (heatmap style). Normality, Homoscedasticity, Linearity,
• 'SalePrice' correlation matrix. Absence of correlated errors
We are in progress to test this assumptions [5] . Create a model to predict house prices using
6. References Python
https://towardsdatascience.com/create-a-model-to-
[1] House Prices: Advanced Regression Techniques predict-house-prices-using-python-d34fe8fad88f
(Dataset)
https://www.kaggle.com/c/house-prices-advanced- [6] A Gentle Introduction to the Gradient Boosting
regression-techniques/data Algorithm for Machine Learning
https://machinelearningmastery.com/gentle-
[2] . A Beginner's Guide to Recurrent Networks and introduction-gradient-boosting-algorithm-machine-
LSTMs - Deeplearning4j: Open-source, Distributed learning/
Deep Learning for the JVM
https://deeplearning4j.org/lstm.html [7] . How to Implement Random Forest From
Scratch in Python
[3] Using Recurrent Neural Networks in DL4J - https://machinelearningmastery.com/implement-
Deeplearning4j: Open-source, Distributed Deep random-forest-scratch-python/
Learning for the JVM
https://deeplearning4j.org/usingrnns [8] Bagging and Random Forest Ensemble
Algorithms for Machine Learning
[4] Pythonic Data Cleaning With NumPy and https://machinelearningmastery.com/bagging-and-
Pandas random-forest-ensemble-algorithms-for-machine-
https://realpython.com/python-data-cleaning-numpy- learning/
pandas/

You might also like