Professional Documents
Culture Documents
E20020-Machine Learning-Report
E20020-Machine Learning-Report
Techniques
Chapter 1-Introduction
Problem Statement:
Problems Faced during buying a house:
1.Buying house is stressful thing
2.Buyers are not generally aware of the factors that influence the
house price
3.Many problems are faced during buying a house
4.Hence real estate agents trusted with communication between
seller and buyer as well as laying down a legal contract. This just
creates another person in between and increases the cost
Problem Objective:
1.Our project is a machine learning model based on certain
specifications of your home it will try to guess the most accurate
price.
2.Information such as location, number of bedrooms etc
3. To predict the sale price of each houses
Reference link:
https://www.ijitee.org/wpcontent/uploads/papers/v8i9/I784907891
9.pdf
You can access code here: https://github.com/phani452/Machine-
learning-project
Motivation
In order to predict the home prices, I chose the housing price dataset
that was sourced from Kaggle. This dataset contains house sale
prices for King County, which includes Seattle. It includes homes sold
between May 2014 and May 2015. It has many characteristics of
learning, and the dataset can be downloaded from here.
Chapter – 2
Exploratory Data Analysis & Anomaly detection
We divided our data into three sets train, validation and test
Data Dictionary:
id - Unique ID for each home sold
lat - Latitude
long - Longitude
Variable Datatypes:
id – Nominal scale with numerical value
Null values:
There are no null values in our Data.
EDA on Train Data:
Univariate analysis:
sqft_living:
From above graphs we can infer that the distribuition is right skewed
and there are outliers
Sqft_lot:
From above graph we can infer that the distribuition is highly right
skewed and there are outliers
Sqft_above:
From above graph we can infer that the sqft_above is right skewed
and there are outliers
Price:
From above graph we can infer that the price follows a exponential
distribution and we can some houses have low and some houses
have maximum price
Bedrooms:
From above graph we can infer that more number of houses have
three bedrooms followed by four number of bedrooms
Bathrooms:
From above graph we can infer that more number of houses have 2.5
bathrooms followed by one bathroom
Grade:
From above graph we can infer that a greater number of houses are
with grade 7 followed by the houses with grade 8 and we can also
Infer that as grade decreases the house price also decreases.
Waterfront:
From above graph we can infer that more than 95% of houses
doesn’t have waterfront
Bivariate Analysis:
Sqft_living vs price:
From above graph we can infer that the sqft_living and price have
linear relationship.
Log transformation of sqft_living vs price:
Sqft_above vs price:
From above graph we can infer that the sqft_living and price have
linear relationship
From the above graph we can infer that the sale price of houses also
depends on the location as the location changes price varies. The
maximum sale price of house is at location with zip code 98102
Yrbuilt vs price:
From above graph we can infer that sale price depends on year when
house was built. we always think that new house will have more sale
price but surprisingly it is not the case the house built in 1910 has the
maximum sale price compared to all other years.
And we can also infer that the old houses have more sale price
compared to the new houses.
Grade vs price:
From above graph we can infer that grade 13 house has highest sale
price so the houses with grade 13 have the maximum sale prices
compared to the all other grades and houses with grade 3 has the
lowest sale.
Bedrooms vs price:
From the above graph we can infer that the house with six bedrooms
has the maximum sale price and bedrooms with 11 has the minimum
and we can also see that there is a house with 33 bedrooms which is
anomaly.
Bathrooms vs price:
From the above graph we can infer that the house with 8 bathrooms
has the maximum sale price and the house with 0.5 has the
minimum price
From the above graph we can infer that the house with 2.5 floors has
the maximum sale price and the house with 1.5 has the minimum
sale price
Anomaly detection:
From the above graph we can say that the house with 33 bedrooms
is an anomaly.
Chapter 3:
Modelling:
Since our target variable price is a continuous variable, we apply
different regression techniques on train and validation data and
apply the best one on test data.
Model 1:
From the correlation matrix we selected the columns which are
highly correlated with the target column
and used them to predict the price.
Model-Multiple Linear Regression
Predictors:
Bedrooms,bathrooms,sqft_living,view,grade,sqft_above,sqft_basem
ent,sqft_living15
We use metrics such as Root mean Squared Error(rmse) and r2_score
to check our model.
We achieved an RMSE of 230245, We achieved an r2 score of 0.58
Model2:
Feature Engineering:
We created a new feature age from the yr_built feature by
calculating the age of the house
Model 3:
From EDA we came to know that the location of the house plays an
important role so we converted zipcode
Variable using get dummies method as it is a categorical variable.
Model 4:
In model 4 we transformed the total, total1, total2
variables using the log transformation and we removed the lat and
long features as we are using zipcode.
trained our model
Chapter 4:
Conclusion
Among the all models model 4 given the lowest rmse value and high
r2 score. Hence, we used model 4 to predict the test data and we
achieved an RMSE of
RMSE=126847
Hence model4 can be used for deployment
Github link:
https://github.com/phani452/Machine-learning-project
Reference link:
https://www.ijitee.org/wpcontent/uploads/papers/v8i9/I784907891
9.pdf
Submitted by:
Phaneendra
E20020