S2 PDF

ABSTRACT
This project provides us an overview on how

to predict house prices using various
machine learning models with the help of
different Python libraries. This proposed
model considers as the most accurate
HOUSE PRICE model used for calculating the house price

and provides the most accurate prediction.
Project Synopsis
PREDICTION MCSP-232 (MCA_NEW)
Using Machine Learning

IGNOU Version1.1 Project Synopsis
Table of Contents
1. Project Title --------------------------------------------------------------------------------------------- 2
2. Introduction ------------------------------------------------------------------------------------------- 2
2.1. Objectives ------------------------------------------------------------------------------------ 3
3. Project category ------------------------------------------------------------------------------------ 4
4. Tools/Platform ------------------------------------------------------------------------------------ 4
4.1. Software platform --------------------------------------------------------------------------- 4
4.2. Hardware platform --------------------------------------------------------------------------- 4
4.3. Tools ---------------------------------------------------------------------------------------------- 4
5. Problem Statement ---------------------------------------------------------------------------------- 5
5.1. Project Description -------------------------------------------------------------------------- 5
6. Scope of Work ----------------------------------------------------------------------------------------- 7
6.1. Advantages of project ----------------------------------------------------------------------- 7
7. Analysis --------------------------------------------------------------------------------------------------- 7
7.1. DFD ----------------------------------------------------------------------------------------------- 8
7.2. Architectural Diagram ----------------------------------------------------------------------- 9
8. Overall Architecture -------------------------------------------------------------------------------- 10
i. Project Pipeline -------------------------------------------------------------------------- 10
ii. Data Cleaning ----------------------------------------------------------------------------- 10
iii. EDA ------------------------------------------------------------------------------------------ 11
iv. Feature Engineering -------------------------------------------------------------------- 12
v. Machine Learning ------------------------------------------------------------------- 12
vi. Model Selection ---------------------------------------------------------------------------- 13
vii. Training -------------------------------------------------------------------------------- 15
viii. Evaluation ------------------------------------------------------------------------------- 16
9. Limitations of the project --------------------------------------------------------------------------- 18
10. Further Enhancement -------------------------------------------------------------------------------- 19
10.1. Conclusion -------------------------------------------------------------------------------------- 19
11. Bibliography --------------------------------------------------------------------------------------------- 20
Page 1 of 20
1. PROJECT TITLE
This project titled as “HOUSE PRICE PREDICTION” is a machine learning
based project that can be used to predict the most accurate price of a
property in a very efficient, secured and fastest manner.
2. INTRODUCTION
One of the basic requirements of livelihood in the recent world is to buy a
house of your own.
The price of house may depend on various sectors. Real estate agents and
manhole are involved in selling a house want a tag price on the house
which would be the real worth of buying the house. The prediction of the
price of the house is often very hard for the inexperienced.
House is a fundamental need to someone and them costs range from place
to place primarily based totally at the facilities to be had like parking area,
location, etc. The residence price is a factor that issues a lot of citizens
either wealthy or white-collar magnificence as possible by no means decide
or gauge the value of a residence primarily based totally on place or places
of work accessible. Purchasing a residence is the best and unique desire of
a own circle of relatives because this expends the whole thing of their
funding budget and every now and then cover them under loan. It is a hard
challenge for expecting the correct value of residence price. This proposed
version could make viable who are expecting the precise costs of house.
This model of Linear regression in machine learning takes the internal
factor of house valuation dependencies like area, no of bedrooms, locality
etc. and external factors like air pollution and crime rates.
Here in this project, we are going to machine learning to build a predictive
model for estimation of the house price for real estate customers. In this
project we are going to build the Machine learning model by using python
programming and other python tools like NumPy, pandas, matplotlib etc.
We are also using scikit-learn library in our approach of this project. For
going further into this project, we prepare dataset which consists of features
like no of bedrooms, area of house, locality, etc. After preparing the dataset
we will use 80 percentage of data for training the ml model and 20
percentage of data for testing the ml model.
Page 2 of 20
Buying a house is one of the most important decisions for an individual. The
price of a house is dependent on wide factors which are its features such
as no of bedrooms, area of construction, locality of the house etc. and
many external factors such as air pollution and crime rate. These all factors
make the prediction of price of the house more complex. This house price
prediction is beneficial for many real estates. So, an easier and accurate
method is needed for the prediction of house price. This project is proposed
to predict the price of a house for real estate customers based on their
preferences like no of rooms, no of bedrooms, area of the house, locality
etc. This machine learning model to predict the house price will be very
helpful for the customers, because it is being the major issue for the people,
as they can never estimates the price of the house on the basis of locality
or other features available.
2.1. OBJECTIVES
This project provides us an overview on how to predict house prices using
various machine learning models with the help of different Python libraries.
This proposed model considers as the most accurate model used for
calculating the house price and provides the most accurate prediction. This
provides a brief introduction which will be needed to predict the house
price. This project consists of what and how the house price model works
with the assistance of machine learning technique using scikit - learn and
which data sets will be using in our proposed model.
Predicting the price of a house helps to determine the selling price of house
in a particular region and it help people to find correct time to buy a house.
In this task on house price prediction using machine learning, our task is to
use data to create a machine learning model to predict houses in the given
region. By using real world data entities, we are going to predict the price of
the house in that area.
Page 3 of 20
3. PROJECT CATEGORY
Machine Learning (ML), is a type of artificial intelligence (AI) that allows
software applications to become more accurate at predicting outcomes
without being explicitly programmed to do so.
ML project is journey. This journey typically involves an agile process of
data discovery, feasibility study, building a minimum viable model (MVM)
and finally deploying that model into
prediction.
4. TOOLS AND PLATFORM

a) Software Platforms:-
The software platforms used in this project are Anaconda Navigator,
Jupyter Notebook and Spyder IDE.
b) Hardware Platforms:-
i) Minimum 7th Generation Intel® Core i7 Processors.
ii) 1CB RAM.
iii) 512GB disk space.
iv) A minimum of 256 GB storage.
v) High Definition or FHD display.
c) Tools: -
(i) Language : - I used Python for the project. Jupyter notebooks
are popular among data scientist because they are easy to follow
and show your working steps.
Page 4 of 20
(ii) Libraries: - There are frameworks in python to handle commonly
required tasks. I implore any budding data scientists to familiarise
themselves with these libraries :
a) Pandas : - for handling structured data
b) Scikit Learn : - For machine learning
c) Numpy :- For linear algebra and mathematics
d) Seaborn and Matplotlib :- For data visualisation
5. PROBLEM STATEMENT
 The prices of real estate properties are sophisticatedly linked with our
economy. Despite this, we do not have accurate measures of house
prices based on the vast amount of data available.
 Proper and justified prices of properties can bring in a lot of
transparency and trust back to real estate industry which is very
important for most customers especially in India.
5.1. PROJECT DESCRIPTION
Domain Problem to ML Problem
Data Collection and Preparation
Training
Deploying
Page 5 of 20
 Domain problem to ML problem: -
1. Domain Problem: In this project we are going to predict the price of
a house for real estate customers based on their preferences like
no of bedrooms, area of the house, locality etc.
2. ML problem: This problem can be solved using supervised
learning in ML. It is treated as a Regression problem as the target
is continuously valued.
 Data collection and preparation: - This phase consists of three parts

they are: -
1. Data collection
2. Data pre-processing
3.EDA and feature Engineering
 Data collection:-In order to proceed further into this project, first of all
we should collect the data, after collecting a dataset, then we should
pre-process the data and then we do the exploratory data analysis on
the given data set.
 Data pre-processing:- This pre-processing of data consists of

Handling missing values which means some of values in the dataset
may be missing. We should handle those missing values. Handling
duplicates which mean some of the values in the dataset may be their
multiple times we should handle those duplicate values that means
we should delete those duplicates. Handling outliers, it consists of
identifying the outliers and hence removing those outliers.
 EDA and feature Engineering: - By using matplotlib package in

python we will do EDA on a given dataset. Normally we will do EDA
on a given data set to establish a relation between given feature of
the data set.
 Training: - This phase consists of 4 parts they are:-

1. Choosing a machine learning algorithm
2.Training
3. Evaluation 4.Hyperparameter tuning
 Deploying: - Deploying a machine learning model involves:-

1.Interpreting the results
2.Packaging and shipping
3.Maintenance and monitoring
Page 6 of 20
6. SCOPE OF WORK
The scope of Machine Learning is not limited to the investment sector. Rather,
it is expanding across all fields such as banking and finance, information
technology, media & entertainment, gaming, and the automotive industry. As
the Machine Learning scope is very high, there are some areas where
researchers are working toward revolutionizing the world for the future.
6.1. ADVANTAGES OF PROJECT

i) As earlier, House prices were determined by calculating the acquiring
and selling price in a locality. Therefore, the House Price prediction
model helps in filling the information gap and improve Real Estate
efficiency. With this model, we would be able to better predict the
prices.
ii) House price prediction can help the developer to determine the
selling price of a house and can help the customer to arrange the
right time to purchase a house.
iii) In the contemporary digital era, users will be able to retrieve the
useful information of the society from a wide variety of sources and
stored in the form of structured, unstructured and semi-structured
formats.
7. ANALYSIS
The demand for commercial housing is generally divided into self-occupied
demand, investment demand, and speculative demand. Self-occupation
demand is self-occupation; investment demand is to buy commercial
housing and rent it out to obtain rental income; speculative demand buys
commercial housing in anticipation of rising house prices and sells it after
the price rises; and the purpose is to earn the price difference. The demand
for self-occupation and investment is the supporting force of the commercial
housing market. In particular, the demand for self-occupation has been
encouraged and supported by policies. In addition to driving up housing
prices, speculative demand squeezes out part of the demand for owner-
occupiers and blows up the housing market bubble, which is harmful to the
commercial housing market. This article combines the generalized linear
regression model to build a real estate price prediction model. Through the
simulation and comparison experiments, it can be seen that the housing
price forecasting system based on the generalized regression model
proposed in this article has a high housing price forecasting accuracy.
Page 7 of 20
7.1. DATA FLOW DIAGRAM (DFD)
START
DATASET
TRAIN DATASET TEST DATASET
MODEL SELECTION MODEL BUILDING
MODEL DEPLOYMENT
Page 8 of 20
7.2. ARCHITECTURAL DIAGRAM
Phase I:
Data in real time processes Data from Batch Processes
Phase II:
Data
Store
Phase III:
Data Feature
Transformation Model Engine
Generation
Phase IV:
Performance
Model Output
Tuning
Page 9 of 20
8. OVERALL ARCHITECTURE:-
i. Project Pipeline :- Machine learning project follows the same
process : -
 Data ingestion
 Data cleaning
 Exploratory data analysis
 Feature engineering
 Machine Learning
The pipeline is not linear and we have to jump back and forth
between stages.
ii. Data Cleaning: -
The data we collect are from different sources such as internet,
industry papers, etc. It is quite possible that we get the data in a
raw form. This raw data consists of several errors, unnecessary
data etc. and we cannot give this kind of data directly to our ML
Model as this will affect its accuracy. So we have to clean the data
to make it accessible to ML model. There are many kinds of
mistakes/errors in a data. Some of these are: -
 Duplicates
 Nan’s etc.
Page 10 of 20
iii. Exploratory Data Analysis (EDA) : -

In this step we explore the quality of our clean data. This step come
under the Data Visualization. We can plot different types of graphs to
check the quality of our data such as: -
Box Plot, Histogram, Pie plot, Scatter plot etc.
Page 11 of 20
iv. Feature Engineering: -

Machine Learning Models cannot understand categorical data. So, we
have to apply transformations to convert the categories into
numbers. The best practice for doing this is via one hot encoding.
One Hot Encoder solves this problem with the options that can be
set for categories and handling unknowns. It’s slightly harder to
use but definitely necessary for machine learning.
v. Machine Learning: -
We follow a standard development cycle for machine
learning. Basically, we have to go through many iterations of
the cycle before we are able to get my model working to a
high standard.
Page 12 of 20
Select Machine
Evaluate Results
Learning Model
Training : Tune
Hyperparameters Set Hyper
with grid search parameter ranges
and cross for grid search
validation
Set number of
cross folds for
cross validations
vi. Model Selection: -

There are many types of models to reach the accuracy. As, in this
project we are doing supervised learning so I took basically four
models in action: -
 Decision Tree: - A tree algorithm used in machine
learning to find patterns in data by learning decision tools.
 Random Forest: - It is a type of bagging method that
plays upon ‘wisdom of crowd effect’. It uses multiple
independent decision trees in parallel to learn from data and
aggregates their predictions for an outcome.
 Gradient boosting machines: - A type of boosting method
that uses a combination of trees in a series. Each tree is used
to predict and correct the errors by preceding the tree
additively.
 Linear regression: - is a linear approach for modelling
the relationship between a scalar response and one or more
explanatory variables. Linear regression analysis is used to
predict the value of a variable based on the value of another
variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict
the other variable's value is called the independent variable.
Page 13 of 20
Page 14 of 20
vii. Training: -
In machine learning, “training” refers to the process of teaching our
model using examples from our training data set. In the training stage,
we’ll tune our model hyperparameters.
 Hyperparameters: - Hyper parameters helps us adjust the
complexity of our model. There are some best practices on
what hyperparameters one should tune for each of the
models. Some of hyperparameters are: -
o max_depth — The maximum number of nodes for a
given decision tree.
o max_features — The size of the subset of features to

consider for splitting at a node.
o n_estimators — The number of trees used for boosting
or aggregation. This hyperparameter only applies to the
random forest and gradient boosting machines.
o learning_rate — The learning rate acts to reduce the
contribution of each tree. This only applies for gradient
boosting machines.
 Decision Tree — Hyperparameters tuned are the max_depth and

the max_features
 Random Forest — The most important hyperparameters to tune
are n_estimators and max_features .
 Gradient boosting machines — The most important
hyperparameters to tune are n_estimators, max_depth and
learning_rate .
Page 15 of 20
 Grid search: Choosing the range of

hyperparameters is an iterative process. When we
chose our possible hyperparameter ranges, grid
search allows us to test the model at every
combination of those ranges.
 Cross validation: It is a technique that takes the
entirety of your training data, randomly splits it into
train and validation data sets over 5 iterations.
Models are trained with a 5-fold cross validation.
You end up with 5 different training and validation
data sets to build and test your models. It’s a good
way to counter overfitting.
 Implementation: SciKit Learn helps us bring
together hyperparameter tuning and cross
validation with ease in using GridSearchCv. It
gives us options to view the results of each of
your training runs.
viii. Evaluation: -
This is the last step in the process. This is the decisive step of the
whole process. We can use data visualisation to see the results of
each of our candidate models. If we are not happy with our end
results, we have to revisit our process at any of the stages from
data cleaning to machine learning. Our performance metric will
be the negative root mean squared error (NRMSE).
Page 16 of 20
 Decision Tree: - Basically this was our worst

performing method. In best decision tree our
best score is -0.205 NRMSE.
Hyperparameters tuning in such case does not
make much difference to the models.
However these are trained under few seconds.
Definitely there are some scope to assess a
wider range of hyperparameters.
 Random Forest: - Our random forest model
marked an improvement on our decision tree
with a NRMSE of -0.144. The model took around
75 seconds to train.
 Gradient boosting: - This was our best
performer by a significant amount with a
NRMSE of -0.126. The hyperparameters
significantly impact the results illustrating that
we must be incredibly careful in how we tune
these more complex models. The model took
around 196 seconds to train.
Page 17 of 20
9. LIMITATIONS OF THE PROJECT :

i. This project cannot be trusted in real life scenario.
iii. This project doesn’t affect in accordance to external impacts
such as government policies, recession etc.
iv. Due to very high competition, it is very hard to make the
project profitable.
v. This project also does not show the impact on price in
accordance to different payment methods.
vi. As the accuracy of project is not 100%, a small miscalculation
can result to a major loss.
Page 18 of 20
10. FURTHER ENHANCEMENT

The supplementary feature that can be added to our proposed
system is to avail users of a full-fledged user interface so there can
be multiple functionalities for users to use with the ML model for
numerous locations. Lastly, developing a well-integrated web
application that can predict prices whenever users want it to will
complete the project.
10.1. CONCLUSION
Buying your own house is what every human wish for. Using this
proposed model, we want people to buy houses and real estate at
their rightful prices and want to ensure that they don't get tricked
by sketchy agents who just are after their money. Additionally, this
model will also help Big companies by giving accurate predictions
for them to set the pricing and save them from a lot of hassle and
save a lot of precious time and money. Correct real estate prices
are the essence of the market and we want to ensure that by using
this model. The system is apt enough in training itself and in
predicting the prices from the raw data provided to it. After going
through several research papers and numerous blogs and articles,
a set of algorithms were selected which were suitable in applying
on both the datasets of the model. Predicting the prices of different
houses with various features and was able to handle large sums of
data. The system is quite user-friendly and time-saving.
Page 19 of 20
11. BIBLIOGRAPHY
a. Python for Data Analysis book by O’Rielly
b. Hands on ML with scikit learn,Keras & Tensorflow by
O’Rielly
c. Machine Learning Yearning book by Andrew NG
d. Jain, N.., Kalra, P., &Mehrotra, D. (2020). Analysis of
Factoring Affecting Infant Mortality Rates Using
Decisions Tree . In Soft Computing: Theories and
Applications (pp. 639-656).Springers, Singapore.
e. Malhotra, R., & Sharma, A. (2018). Analyze Machine
Learning Techniques for Fault Prediction Using Web
Applications. Journal of Information Processing Systems
Page 20 of 20

S2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S2 PDF

Uploaded by

Copyright:

Available Formats

ABSTRACT

This project provides us an overview on how

HOUSE PRICE model used for calculating the house price

Using Machine Learning

4. TOOLS AND PLATFORM

5.1. PROJECT DESCRIPTION

Domain Problem to ML Problem

Data Collection and Preparation

 Data collection and preparation: - This phase consists of three parts

 Data pre-processing:- This pre-processing of data consists of

 EDA and feature Engineering: - By using matplotlib package in

 Training: - This phase consists of 4 parts they are:-

 Deploying: - Deploying a machine learning model involves:-

6.1. ADVANTAGES OF PROJECT

7.1. DATA FLOW DIAGRAM (DFD)

TRAIN DATASET TEST DATASET

MODEL SELECTION MODEL BUILDING

7.2. ARCHITECTURAL DIAGRAM

iii. Exploratory Data Analysis (EDA) : -

Box Plot, Histogram, Pie plot, Scatter plot etc.

iv. Feature Engineering: -

vi. Model Selection: -

o max_features — The size of the subset of features to

 Decision Tree — Hyperparameters tuned are the max_depth and

 Grid search: Choosing the range of

 Decision Tree: - Basically this was our worst

9. LIMITATIONS OF THE PROJECT :

10. FURTHER ENHANCEMENT

You might also like