You are on page 1of 3

Modeling House Price Prediction using Linear

Regression Analysis

Abstract—House prices increase every year, so there is a need observed by human senses, including the size of the house, the
for a system to predict house prices in the future. House price number of bedrooms, the availability of kitchen and garage,
prediction can help the developer determine the selling price of the availability of the garden, the area of land and buildings,
a house and can help the customer to arrange the right time
to purchase a house. There are three factors that influence the and the age of the house.
price of a house which include physical conditions, concept and
location. This research aims to predict house prices based on
B. Preprocessing
Iowa City houses with regression analysis. The result obtained is Before feeding data to the algorithms, the data is prepro-
based on root mean square error. cessed. In this model, initially we import the data set and then
Index Terms—regression, lasso, ridge, elastic net, prediction,
the dataset is explored by finding possible correlation between
svm, random forest
depending and independent variables. Since, linear regression
I. I NTRODUCTION models are sensitive to non-linear data, outliers.
The outliers are removed from TotalBSMTSF and GrLi-
Investment is a business activity that most people are vArea. After this, the missing values are dealt with the help of
interested in this globalization era. There are several objects encoding. Non-linear data present in the dataset is transformed
that are often used for investment, for example, gold, stocks to linear data by using log transformation. Some of the features
and property. In particular, property investment has increased are integrated to form a new feature. In feature selection, we
significantly since 2011, both on demand and property selling. select the least number of features which are highly correlated
One of the increasing of property demand is because of high with the dependent variable. In the final step of preprocessing,
population in Africa. standardisation of data is done. We use cross-validation so the
There are several approaches that can be used to deter- scaling has to be done independently for training and testing
mine the price of the house, one of them is the prediction data.
analysis.The first approach is a quantitative prediction. A
quantitative approach is an approach that utilizes time-series III. DATASET
data. The time-series approach is to look for the relationship The dataset for some of the previous works on housing
between current prices and prevailing prices. price predictions is not large. For example, Bahia only studies
The second approach we use is classification as to com- 506 samples to develop the model (Bahia , 2013).There is a
pare between which performs better between linear regres- limitation in the paper by Pow saying the research does not
sion or classification techniques namely support vector ma- have enough historical transaction data. Pow points out using
chine(SVM) and random forest.In linear regression, determin- enough data might increase the performance of predictions
ing coefficients generally using the least square method, but it (Pow, Janulewicz,and Liu, 2014).
takes a long time to get the best formula. This project uses a dataset from Kaggle open source
RMSE is used for selection of affect variables in house datasets. The dataset consists of 80 explanatory features and
prediction, regression is used to determine the optimal coef- 1460 entries of housing sales in Iowa, Africa. It describes
ficient in prediction.Prediction house prices are expected to different aspects of housing sales from 2006 to 2010 (Kaggle
help people who plan to buy a house so they can know the Inc.).
price range in the future, then they can plan their finance The independent variables are date, price, bedrooms, bath-
well. In addition, house price predictions are also beneficial rooms, sqft living, sqft lot, floors, waterfront, view, condition,
for property investors to know the trend of housing prices in a grade, sqft above,sqft basement, yr built, yr renovated, zip-
certain location. This research is focused in Iowa City based code, latitude, longitude, sqft living15, and sqft lot15. We can
in Africa. see that these variables include categorical variables, numerical
variables, and time series variables. The dependent variable is
II. R ELATED W ORK
the sale price of houses from 2006 to 2010 in Iowa,Africa.
A. House Price Affecting Factors
IV. R ESEARCH M ETHODOLOGY
There are several factors that affect house prices. In this
research we divided these factors into three main groups, A. Regression Analysis
there are physical condition, concept and location. Physical By using scikit-learn, the regression models are imple-
conditions are properties possessed by a house that can be mented. We aim to measure the performance of each model
Lasso is a linear model that estimates sparse coefficients, i.e., it
reduces the number of variables upon which the given solution
is dependent. It does a kind of feature selection and thus lead
to a less complex final model. For instance, when there are
correlated features it will choose one and set the coefficient
of the other to zero. The regularization parameter alpha (L1)
controls the degree of sparsity of the coefficients estimated
and we employed once more the version of the algorithm that
automatically chooses the best value.
4) Elastic Net: Is a hybrid method that trains with L1 and
L2 as regularize rs. This combination allows for learning a
sparse model where few of the weights are non-zero like
Lasso, while still maintaining the regularization properties of
Fig. 1. Diagram of flow research.. Ridge. It is useful when there are multiple features which are
correlated with one another. Lasso is likely to pick one of
these at random, while Elastic Net is likely to pick both. The
and compare it with other models. For validation purpose we objective function to minimize is in this case (5)
use cross-validation, different linear regression models used
are 1 α(1 − ρ)
• Ordinary least square
min ||Xw − y||22 + αρ||w||1 + ||w||22
w 2nsamples 2
• Lasso (5)
• Ridge 5) Random Forest: Random Forest is one of the most
• Elastic Net versatile and precise classifier. It does not need the data being
• Random Forest scaled and can deal with any number of features. In order to
1) Ordinary Least Squares: Least Square Error is a well test the random forest algorithm a little further than the others,
known mathematical measure of the performance of a linear we decided to train it using different train sets. First, with a
regression model. LSE works by changing the coefficients of data set containing all the features and no scaling. Second, in
the model in a way that minimize the sum of squares between the reduced train set with scaled features.
the true values and the predicted values. It solves a problem As we can see in the results bellow, it did well in both
of the form (1): cases. This test was also interesting to validate our set of
2 selected features. Seems they are really good options once the
min ||Xw − y||2 (1)
w
performance using the the full set is not significantly better
and can be solved analytically by the equation (2) than using the restricted set.
After some experimentation, the configuration we chose
β̂ = (X T X)−1 X T y (2) was: the number of trees is 500 as any greater value did
not enhance the accuracy, MSE as the function to measure
Where X is a matrix of the independents features, y is the
the quality of a split, max depth with None which means the
actual response and β̂ the estimated weights w.
nodes are expanded until all leaves are pure or until all leaves
2) Ridge: Ridge regression addresses some of the problems
contain less than min samples split samples, max features =
of Ordinary Least Squares by imposing a penalty on the size
auto to consider all the features when looking for the best split
of coefficients. The new equation of a penalized residual sum
and bootstrap = True for replacement as it gave better results
of square is (3)
during the tests. All the others parameters were maintained in
2 2 their default values.
min ||Xw − y||2 + α||w||2 (3)
w

Here, α ≥ 0 is called the regularization parameter (L2) B. Classification


and controls the amount of shrinkage. Higher the values 1) SVM: SVM is a large margin classifier. The rationale
of alpha, bigger is the penalty and therefore the magnitude behind having decision boundaries with large margins is that
of coefficients are reduced. It is worth to note that we they tend to have a lower generalization error. Whereas,
used RidgeCV which does an implicit leave-one-out cross- models with small margins are more prone to overfitting. We
validation to choose the best alpha. can control the width of the margin using the regularization
3) Lasso: The Mathematics behind lasso regression is quiet parameter C. Lower values of C give smaller margins and vice
similar to that of ridge. the only difference is instead of adding versa.
squares of theta, we add the absolute value of w: (4) We have used the linear kernel as it gave far the best result
1 when comparing to rbf or sigmoid. It is a sign that a linear fit
min ||Xw − y||22 + α||w||1 . (4) is well adjusted to the true data and also can explain that no
w 2nsamples
regularization was necessary. The best value for C was 1. We doing more feature engineering, as we did when we create the
kept the other parameters in their default values. TotalSF, or doing more useful transformations such as Box-
Cox or log transformation to other variables to reduce their
variability. There is no set model for house price prediction
depending on the dataset the model accuracy may vary for
this given dataset of Iowa, the minimum RMSE is recorded
for elastic net and maximum RMSE is recorded for random
forest regression.

Fig. 2. Comparsion between RMSE

C. Testing Methods
The model developed in this research will be tested using
Root Mean Square Error (RMSE). RMSE is used to calculate
predicted performance by considering the prediction error of
each data. RMSE formula can be seen here as (6)
r n
1X 2
RM SE = (di − pi ) (6)
n i=1
V. EXPERIMENT AND RESULT
The experimental process examines the parameters used on
particle swarm optimization such as particle test, iteration test,
and also inertia weight combination test.
Result generated from the models are judged over the
training data which takes 80% of the total dataset, we compare
all the models on the basis of RMSE. We take RMSE since
root mean square value helps in normalising the dataset.The
minimum RMSE value recorded for the model is given by
elastic net where the value is
Average RMSE: 0.12379041525319807
All other methods give fairly similar performance. Hence,
it can be seen that regularisation is not a big issue for this
dataset.
When classification model SVM is introduced then the value
of RMSE increased which proves that classification cannot be
used for house price prediction.
VI. C ONCLUSION
Data preprocessing have been proven to be a crucial part of
our work, for instance adressing the non-linearity problem with
log transformation improved the performance dramatically.
Moreover, removing the outliers also yield better results.
Encoding the features according to their type: nominal, or-
dinal, and numerical is also critical to our work. One way of
improving our results is creating an ordinal version of the
location, because, as we know, location is quite important
factor in most housing prices. We can also improve our model

You might also like