You are on page 1of 19

A

Mini Project Report


on

“House Price Prediction using area of houses in


Monroe,New Jersey”
submitted in partial fulfilment of the requirements

of the degree of

Bachelor of Technology – B.Tech ITDS

by
Abhishek Awhale
URN NO: 2019-B-24072001B

Under the Guidance of


Siddharth Nanda

December 2021
School of Engineering
Ajeenkya D Y Patil University, Pune

Declaration of Originality

I, Abhishek Awhale, URN 2019-B-24072001B, hereby declare that this project entitled “House
Price Prediction using Area of House in Monroe, New Jersey” presents my original work carried
out as a bachelor student of School of Engineering, Ajeenkya D Y Patil University, Pune,
Maharashtra. To the best of my knowledge, this project report contains no material previously
published or written by another person, nor any material presented by me for the award of any
degree or diploma of Ajeenkya D Y Patil University, Pune or any other institution. Any
contribution made to this research by others, with whom I have worked at Ajeenkya D Y Patil
University, Pune or elsewhere, is explicitly acknowledged in the project report. Works of other
authors cited in this project report have been duly acknowledged under the sections “Reference”
or “Bibliography”. I also declare that I have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my
submission.

I am fully aware that in case of any non-compliance detected in future, the Academic Council of
Ajeenkya D Y Patil University, Pune may withdraw the degree awarded to me on the basis of the
present project report.

Date: 09th December 2020


Place: Lohegaon, Pune
Abhishek Awhale

ii
Acknowledgement

I remain immensely obliged to Prof. Siddhart Nanda – Project Supervisor, for providing
me with the idea of this topic, and for his/her invaluable support in garnering resources for me
either by way of information or computers also his guidance and supervision which made this
Internal Project happen.

I would like to thank Prof. Siddharth Nanda, Program Coordinator B.Tech, and Dr.
Biswajeet Champaty, Head of Department for their invaluable support.

I would like to say that it has indeed been a fulfilling experience for working out this
Internal Project.

Abhishek Awhale

iii
Dec 2021

CERTIFICATE

This is to certify that the project entitled “House Price Prediction using Area of
House in Monroe, New Jersey” is a bonafide work of “Abhishek Awhale” (Roll
No.01) submitted to the Ajeenkya D Y Patil University, Pune in partial fulfillment
of the requirement for the award of the degree of “Bachelor of Technology (B.Tech)
in Information Technology in Data Science”.

Prof. Siddharth Nanda


Project Supervisor

iv
Dec 2021

Supervisor’s Certificate

This is to certify that the project entitled “House Price Prediction using Area of
House in Monroe, New Jersey” submitted by Nikhil Shinde, URN: 2019-B-
17062000A, is a record of original work carried out by him/her under my supervision
and guidance in partial fulfillment of the requirements of the degree of Bachelor of
Technology (B.Tech) at School of Engineering, Ajeenkya D Y Patil University,
Pune, Maharashtra 412105. Neither this project report nor any part of it has been
submitted earlier for any degree or diploma to any institute or university in India or
abroad.

Prof. Siddharth Nanda


Project Supervisor

v
Abstract

House Price Index (HPI) is commonly used to estimate the changes in housing price. Since housing
price is strongly correlated toother factors such as location, area, population, it requires other
information apart from HPI to predict individual housing price.
There has been a considerably large number of papers adopting traditional machine learning
approaches to predict housing prices accurately, but they rarely concern about the performance of
individual models and neglect the less popular yet complex models.
As a result, to explore various impacts of features on prediction methods, this paper will apply both
traditional and advanced machine learning approaches to investigate the difference among several
advanced models. This paper will also comprehensively validate multiple techniques in model
implementation on regression and provide an optimistic result for housing price prediction.

Keywords: Housing Price Prediction; Linear Regression; Machine Learning; Stacked


Generalization

vi
TABLE OF CONTENTS

TITLE PAGE
NO.
DECLARATION OF ORIGINALITY ii
ACKNOWLEDGEMENT iii
CERTIFICATE iv
SUPERVISOR’S CERTIFICATE v

ABSTRACT……………………………………………………………………………. vi
TABLE OF CONTENTS….…………………………………………………………… vii
LIST OF FIGURES……………………………………………………………………. viii
LIST OF TABLES……………………………………………………………………... ix
LIST OF ABRIVATIONS……………………………………………………………... x

CHAPTER 1: INTRODUCTION
1.1 Introduction………………………………………………………….. xi-xii
CHAPTER 2: CODE IMPLEMETATION xiii-
xiv
CHAPTER 3: RESULTS and VISUALIZATION xv-
xviii
CHAPTER 4: CONCLUSION xix

CHAPTER 5: BIBLIOGRAPHY xx

vii
❖ List of Figures

• Fig1: Importing packages


• Fig2: Data reading
• Fig3: Describe data
• Fig4: prizes vs area
• Fig5: Regression Model
• Fig6: Data Training
• Fig7: Predicting Accuracy

viii
❖ List of Tables

• Table 1: Abbreviation Table


• Table 2: Library and their uses

ix
List of Abbreviation

MSE Mean Squared Error

MAE Mean Absolute Error

LR Linear Regression

Table 1: Abbreviation Table

x
1. INTRODUCTION

1.1 INTRODUCTION

In this notebook, we learn how to use scikit-learn to implement simple linear regression. We
download a dataset that is related to house prices based on their areas. Then, we split our data into
training and test sets, create a model using training set, evaluate your model using test set, and
finally use model to predict unknown value.

We have downloaded a home prices dataset, Homeprices.csv, which contains model-specific


prices of houses and es for houses based on their area located in Monroe in New Jersey

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are
mutually exclusive. After which, you train with the training set and test with the testing set. This
will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is
not part of the dataset that have been used to train the data. It is more realistic for real world
problems.

This means that we know the outcome of each data point in this dataset, making it great to test
with! And since this data has not been used to train the model, the model has no knowledge of the
outcome of these data points. So, in essence, it’s truly an out-of-sample testing.

Dataset source

• AREA e.g. 2600


• PRICE e.g. 550000

xi
The most basic regression algorithm which make predictions by simply computing weighted sum
of input features adding a bias term.

A linear regression is just the equation of the line,

: is the dependent variable, what the model will predict, in this case the CO2 Emissions.
: is the independent variable, what the model will use to predict , in this case the Engine Size.

: is the intersection with the axis of Emissions.

: is the slope of the model.


The library Scikit-learn provides a linear model which calculates the values of and .

S. No. Used For Tools


1 For Data Visualization matplotlib
2 For Data Analysis pandas
3 For Numerical Operation NumPy
Table 2: Library and their uses

xii
2. CODE IMPLEMENTATION

#Importing Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

#using scikit
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error as mae

# Data Prep
df = pd.read_csv('C:/Users/Abhishek/OneDrive/Desktop/csv/homeprices.csv')
df

#Looking at the data (summarization)

%matplotlib inline
plt.xlabel('area')
plt.ylabel('price')
plt.scatter(df.area,df.price,color='red',marker='+')

# plot price vs Area


new_df = df.drop('price',axis='columns')
new_df

price = df.price
price

#Linear Regression
reg = linear_model.LinearRegression()
reg.fit(new_df,price)

reg.predict([[3300]])
reg.coef_
reg.intercept_
xiii
#Generate CSV file with list of home price predictions
area_df = pd.read_csv("C:/Users/Abhishek/OneDrive/Desktop/csv/areas.csv")
area_df

p = reg.predict(area_df)
p

area_df['prices']=p
area_df

area_df.to_csv("C:/Users/Abhishek/OneDrive/Desktop/prediction.csv")

#to show how my linear equation line looks


%matplotlib inline
plt.xlabel('area',fontsize=20)
plt.ylabel('price',fontsize=20)
plt.scatter(df.area,df.price,color="red",marker='+')
plt.plot(df.area,reg.predict(df[['area']]),color='blue')

#check MAE and MSE


from sklearn.metrics import mean_absolute_error as mae
mae(new_df,price) # mean absolute error
from sklearn.metrics import mean_squared_error as mse
mse(new_df,price) # mean squared error
mse(new_df,price,squared=False)

xiv
3. RESULT AND VISUALIZATION

Importing Needed packages

Fig1: Importing packages

Reading the data in

Fig2: Data reading

xv
Plot prices with respect to Areas

Fig6: prices vs areas

Linear Regression Model

Fig7: Regression Model

xvi
Train data distribution

Fig8: Data Training

Predicting MAE and MSE

Fig9: Predicting Accuracy

xvii
4. CONCLUSION

Is always a good practice to visualize all the data to see if there is a linear tendency or not, because
that can result to discover that the data is not quite linear, instead non-linear? Here the data was
linear, so using a linear regression was a good choice. Their differences discovered with three
methods shown that using more data to predict a value is mostly wise, but using a lot of data can
result too in an overfitting of the model, so one must be careful. In this case the data doesn't show
any sign of overfitting because with the values of MAE were always decreasing with the new data
and the value was increasing. So there was not any discrepancy. With the last model can we say
that the fitting is quite representative and the model fits the data well.

We move to a different dataset, since prices of home are mostly linearly dependent on all its
independent parameters. We explore the different kinds of linear curves viz. exponential etc. and
try to find out the best fitting curve to determine the home prizes.

Analysis is done using Python Scikit-learn library on Jupyter notebooks. Accuracy of each model
is verified using Residual MSE and Mean absolute error.

• Importing the essential packages to perform linear regression.


• Then we train the dataset by using scikit to predict the accuracy of the data. In the form, of
regression line.

xviii
5. REFERENCES

1. youtube

2. Github

xix

You might also like