Aastha Mahajan Python File

INTRODUCTION
One of the basic requirements of livelihood in the recent world is to buy a house of your own.
The price of the house may depend on various factors. Real estate agents and many who are involved
in selling the house want a price tag on the house which would be the real worth of buying the house.
The prediction of the price of the house is often very hard for the inexperienced.
House is a fundamental need to someone and them costs range from place to place primarily
based totally at the facilities to be had like parking area, location, etc. The residence price is a factor
that issues a lot of citizens either wealthy or white-collar magnificence as possible by no means decide
or gauge the value of a residence primarily based totally on place or places of work accessible.
Purchasing a residence is the best and unique desire of a own circle of relatives because this expends
the whole thing of their funding budget and every now and then cover them under loan. It is a hard
challenge for expecting the correct value of residence price. This proposed version could make viable
who are expecting the precise costs of house. This model of Linear regression in machine learning takes
the internal factor of house valuation dependencies like area, no of bedrooms, locality etc. and
external factors like air pollution and crime rates. This Linear regression in machine learning gives the
output of price of the house with more accuracy.
Here in this project, we are going to use linear regression algorithm (a supervised learning
approach) in machine learning to build a predictive model for estimation of the house price for real
estate customers. In this project we are going to build the Machine learning model by using python
programming and other python tools like NumPy, pandas, matplotlib etc. We are also using scikit-learn
library in our approach of this project. For going further into this project, we prepare dataset which
consists of features like no of bedrooms, area of house, locality, etc. After preparing the dataset we
will use 80 percentage of data for training the ml model and 20 percentage of data for testing the ml
model.
Buying a house is one of the most important decisions for an individual. The price of a house is
dependent on wide factors which are its features such as no of bedrooms, area of construction, locality
of the house etc. and many external factors such as air pollution and crime rate. These all factors make
the prediction of price of the house more complex. This house price prediction is beneficial for many
real estates. So, an easier and accurate method is needed for the prediction of house price.
This project is proposed to predict the price of a house for real estate customers based on their
preferences like no of rooms, no of bedrooms, area of the house, locality etc. This machine learning
model to predict the house price will be very helpful for the customers, because it is being the major
issue for the people, as they can never estimates the price of the house on the basis of locality or other
features available.
To tackle this problem, we are going to design a ML model with the assistance of a linear regression
algorithm.
Page | 1
OBJECTIVE
Price prediction uses an algorithm to analyze a product or service based on its characteristics,
demand, and current market trends. Then the software sets a price at a level it predicts will both attract
customers and maximize sales. In some circles, the practice is called price forecasting or predictive
pricing.
House price prediction can help the developer determine the selling price of a house and can help
the customer to arrange the right time to purchase a house. There are three factors that influence the
price of a house which include physical conditions, concept and location.
House price prediction using machine learning(Linear regression model) We have used Kaggle
kc_house_data dataset. Import the required libraries and modules, including pandas for data
manipulation, scikit-learn for machine learning algorithms, and LinearRegression for the linear
regression model.
Page | 2
BACKGROUND
House price prediction can help the developer determine the selling price of a house and can help
the customer to arrange the right time to purchase a house. There are three factors that influence the
price of a house which include physical conditions, concept and location.. By harnessing the capabilities
of Python libraries such as NumPy, Pandas, Seaborn, and matplotlib.
The "House Price Prediction" project focuses on predicting housing prices using machine
learning techniques. By leveraging popular Python libraries such as NumPy, Pandas, Scikit-learn
(sklearn), Matplotlib, Seaborn, and XGBoost, this project provides an end-to-end solution for accurate
price estimation.
IDE Use In a Project –
Colaboratory : Google Colaboratory, popularly known as Colab, is a web IDE for python that
was released by Google in 2017. Colab is an excellent tool for data scientists to execute Machine
Learning and Deep Learning projects with cloud storage capabilities.
Colab is basically a cloud-based Jupyter notebook environment that requires no setup. What’s
more, it provides its users free access to high compute resources such as GPUs and TPUs that
are essential to training models quickly and more efficiently.
Google Colab is built around Project Jupyter code and hosts Jupyter notebooks without requiring
any local software installation. But while Jupyter notebooks support multiple languages,
including Python, Julia and R, Colab currently only supports Python.
Page | 3
Data Set Use In Project –
1. House_size.csv
2. House_location.csv
3. House_area_type.csv
4. House_Data.csv
Steps to Develop a Project –
1. Data Collection
• Obtain a dataset containing house information from sources like IMDb, Kaggle.
• Ensure the dataset includes area_type,availability,location,size,society,total_sqft.
2. Data Preprocessing
• Load Dataset - Obtain a dataset containing information about house.
• Clean Data - Handle missing values, remove duplicates, and format data appropriately.
• Transform Data - Convert categorical variables into numerical representations if
needed.
• Normalize Ratings - Normalize rating data to ensure consistency.
3. Feature Engineering
We first look into dimensionality reduction, where we consider the correlation between ‘total_rooms,’
‘total_bedrooms,’ and ‘households.’ By understanding their relationships, we can identify if any of these
features can be combined or reduced without losing valuable information.
4. Building Recommendation Model

• Implement recommendation techniques such as collaborative filtering, content-based
filtering, or hybrid approaches.
• Utilize NumPy for efficient computations and data manipulation required for model
development.
• Train the model using the preprocessed dataset to learn patterns and make accurate
recommendations.
Page | 4
Python Libraries Use In Project -
1. Pandas - Pandas is a powerful Python library for data manipulation and analysis. It provides
data structures like DataFrame and Series, allowing users to handle structured data effectively.
Pandas simplifies tasks such as cleaning, filtering, and transforming data. With its intuitive
syntax and comprehensive functionality, Pandas is widely used in data science, machine
learning, and statistical analysis workflows.
1. pd.read_csv() - This function is used to read data from a CSV file into a Pandas
DataFrame. It takes in the path to the CSV file as input and returns a DataFrame
containing the data.
2. DataFrame.head() - This function is used to view the first n rows of a DataFrame. By
default, it returns the first 5 rows, but you can specify the number of rows to display.
3. DataFrame.tail() - This function is used to view the last n rows of a DataFrame. By
default, it returns the last 5 rows, but you can specify the number of rows to display.
4. DataFrame.info() - This function provides a concise summary of a DataFrame,
including the data types of each column and the number of non-null values. It's useful for
quickly understanding the structure of the data.
5. DataFrame.merge() -This function is used to merge two DataFrames based on a
common column or index. It performs a database-style join operation, allowing you to
combine data from different sources.
6. DataFrame.plot() - This function is used to create various types of plots directly from a
DataFrame. It provides a convenient way to visualize data using matplotlib under the
hood.
7. DataFrame.pivot_table() - This function creates a pivot table from a DataFrame,
allowing you to summarize and aggregate data based on one or more columns.
8. DataFrame.dropna() - This function is used to remove rows or columns with missing
values (NaN). It provides options to drop rows or columns based on different criteria.
2. Numpy - NumPy is a fundamental Python library for numerical computing. It provides support
for multi-dimensional arrays and matrices, along with a variety of mathematical functions to
operate on these arrays efficiently. NumPy's capabilities include array manipulation, linear
algebra operations, Fourier transform, and random number generation. It serves as a cornerstone
for scientific computing and data analysis in Python.
Page | 5
1. numpy.array() - This function creates a NumPy array from a Python list or tuple. It
allows you to create multi-dimensional arrays easily.
2. numpy.arange() - This function returns an array with evenly spaced values within a
specified range. It is similar to Python's range() function but returns an array instead of a
list.
3. numpy.zeros() - This function creates an array filled with zeros of a specified shape.
4. numpy.ones() - This function creates an array filled with ones of a specified shape.
5. numpy.sum() - This function computes the sum of array elements over a specified axis
or the whole array if no axis is specified.
6. numpy.random.rand() - This function generates random numbers from a uniform
distribution over the interval [0, 1].
1. Seaborn - Seaborn is an amazing visualization library for statistical graphics plotting in

Python. It provides beautiful default styles and color palettes to make statistical plots more
attractive. It is built on top matplotlib library and is also closely integrated with the data
structures from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data. It
provides dataset-oriented APIs so that we can switch between different visual representations
for the same variables for a better understanding of the dataset.
Different categories of plot in Seaborn

• Relational plots: This plot is used to understand the relation between two
variables.
• Categorical plots: This plot deals with categorical variables and how they can be
visualized.
• Distribution plots: This plot is used for examining univariate and bivariate
distributions.
• Regression plots: The regression plots in Seaborn are primarily intended to add a
visual guide that helps to emphasize patterns in a dataset during exploratory data
analyses.
• Matrix plots: A matrix plot is an array of scatterplots.
• Multi-plot grids: It is a useful approach to draw multiple instances of the
same plot on different subsets of the dataset.
• Cosine Similarity - The Cosine Similarity library in Python computes the cosine similarity
between pairs of vectors, measuring the cosine of the angle between them in a multi-
dimensional space.
Page | 6
HARDWARE AND SOFTWARE REQUIREMENTS
Hardware Requirements -
HARDWARE TOOLS MIMINUM REQUIREMENTS
PROCESSOR Core i3 or above
HARD DISK 50 GB
RAM 4.00 GB (Recommended 8.00 GB)
MONITOR RESOLUTION 1024x768
Software Requirements –
SOFTWARE TOOLS MINIMUM REQUIREMENTS
OPERATING SYSTEM WINDOWS 7/8/10/11, macOS 10.14 or

later, Any LINUX distribution
BIT 64-bit Operating System
TECHNOLOGY Python 3.12.0
PROGRAMING LANGUAGE Python
IDE Colab Notebook
Page | 7
CODING
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/bengaluru-house-price-data/Bengaluru_House_Data.csv')
df.head(2)
df.info()
df.isnull().sum()
df['location'] = df['location'].fillna('Sarjapur Road')
df['size'] = df['size'].fillna('2 BHK')
df['bath'] = df['bath'].fillna(df['bath'].median())
# extract bhk from size
df['bhk'] = df['size'].str.split().str.get(0).astype(int)
numeric_values = pd.to_numeric(df['total_sqft'], errors='coerce')
non_numeric_values = df['total_sqft'][numeric_values.isna()].tolist()
Page | 8
def convertRange(x):
temp = x.split('-')
if len(temp)==2:
return (float(temp[0])+float(temp[1]))/2
try:
return float(x)
except:
return None
df['total_sqft'] = df['total_sqft'].apply(convertRange)
df.dropna(inplace=True)
df['Price_per_square_feet'] = df['price']*100000 / df['total_sqft']
location_count = df['location'].value_counts()
location_count_less_10 = location_count[location_count<10]
df['location'] = df['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)
# filter >= 300
df = df[((df['total_sqft']/df['bhk'])>=300)]
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vector)
movies[movies['title_x'] == 'Raja Hindustani'].index[0]
def remove_outliers_sqft(df):
df_output = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.total_sqft / subdf.bhk)
st = np.std(subdf.total_sqft / subdf.bhk)
gen_df = subdf[((subdf.total_sqft / subdf.bhk) > (m - st)) & ((subdf.total_sqft / subdf.bhk) <= (m

+ st))]
df_output = pd.concat([df_output, gen_df], ignore_index=True)
return df_output
Page | 9
df = remove_outliers_sqft(df)
df.drop(columns=['society' , 'availability' ,'area_type' , 'balcony'] , inplace=True)
df.drop(columns='size' , inplace=True)
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()
# Create a heatmap
plt.figure(figsize=(7, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
X = df.drop(columns=['price'])
y = df['price']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
column_trans=make_column_transformer((OneHotEncoder(sparse=False),['location']),remainder='pas
sthrough')
scaler=StandardScaler()
Page | 10
lr = LinearRegression()
pipe = make_pipeline(column_trans,scaler,lr)
pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)
y_pred_ridge = pipe.predict(X_test)
r2_score(y_test,y_pred_ridge)
Page | 11
OUTPUT –
Page | 12
Page | 13
Page | 14
FUTURE SCOPE
Enhanced Accuracy: Continuously improving the accuracy of the prediction models by
incorporating new data sources, refining algorithms, and implementing advanced machine learning
techniques.
Integration of Additional Features: Incorporating more features such as neighborhood demographics,
crime rates, proximity to amenities, and environmental factors to make the predictions more
comprehensive and accurate.
Real-Time Prediction: Developing real-time prediction capabilities that can adjust prices dynamically
based on market trends, economic indicators, and other relevant factors.
Interactive Visualization: Building interactive visualization tools to help users explore and understand
the factors influencing house prices and the predictions generated by the model.
Expansion to Other Regions: Scaling the project to cover a wider geographical area or specific regions
with different market dynamics, allowing users to access accurate predictions for various locations.
Mobile Applications: Developing mobile applications that provide users with on-the-go access to house
price predictions, property listings, and other related information.
Integration with Real Estate Platforms: Partnering with real estate platforms to integrate the prediction
models, providing users with insights into property values and investment opportunities.
Incorporation of Market Sentiment Analysis: Integrating sentiment analysis of real estate market news,
social media discussions, and other sources to gauge market sentiment and its impact on house prices.
Personalized Recommendations: Offering personalized recommendations for property investments
based on individual preferences, risk tolerance, and financial goals.
Regulatory Compliance: Ensuring compliance with regulations related to real estate pricing and data
privacy, especially as the project expands into new markets or involves sensitive user information.
Page | 15
CONCLUSION
In conclusion, this article explored the fascinating realm of house price prediction using
machine learning in Python. By leveraging various algorithms and data preprocessing
techniques, we demonstrated how predictive models can be developed to estimate house
prices with remarkable accuracy.
The significance of feature engineering in enhancing model performance was evident,

as it allowed us to extract meaningful insights from the dataset and capture essential patterns.
The approaches for predicting property prices are constantly changing in line with the area of
machine learning. Such housing market predictions will likely grow more accurate and effective
as new algorithms and data become accessible.
Thus, the machine learning model using linear regression algorithm is very helpful in predicting the
house prices for real estate customers. Here we have used a supervised learning approach in machine
learning field which will yield us a best possible result.
Thus the machine learning model to predict the house price based ongiven dataset is executed
successfully using xg regressor (a upgraded/slighted boosted form of regular linear regression, this
gives lesser error). This model further helps people understand whether this place is more suited for
them based on heatmap correlation. It also helps people looking to sell a house at best time for greater
profit. Any house price in any location can be predicted with minimum errorby giving appropriate
dataset.
Page | 16
REFERENCES
 https://www.python.org/
 https://www.kaggle.com/
 https://pandas.pydata.org/
 https://seaborn.pydata.org/
 https://numpy.org/
 https://www.youtube.com/
Page | 17

Aastha Mahajan Python File

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aastha Mahajan Python File

Uploaded by

Copyright:

Available Formats

INTRODUCTION

IDE Use In a Project –

Steps to Develop a Project –

4. Building Recommendation Model

1. Seaborn - Seaborn is an amazing visualization library for statistical graphics plotting in

Different categories of plot in Seaborn

HARDWARE TOOLS MIMINUM REQUIREMENTS

PROCESSOR Core i3 or above

RAM 4.00 GB (Recommended 8.00 GB)

MONITOR RESOLUTION 1024x768

SOFTWARE TOOLS MINIMUM REQUIREMENTS

OPERATING SYSTEM WINDOWS 7/8/10/11, macOS 10.14 or

df['location'] = df['location'].fillna('Sarjapur Road')

df['size'] = df['size'].fillna('2 BHK')

# extract bhk from size

numeric_values = pd.to_numeric(df['total_sqft'], errors='coerce')

from sklearn.metrics.pairwise import cosine_similarity

for key, subdf in df.groupby('location'):

gen_df = subdf[((subdf.total_sqft / subdf.bhk) > (m - st)) & ((subdf.total_sqft / subdf.bhk) <= (m

df_output = pd.concat([df_output, gen_df], ignore_index=True)

df.drop(columns=['society' , 'availability' ,'area_type' , 'balcony'] , inplace=True)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

from sklearn.model_selection import train_test_split

The significance of feature engineering in enhancing model performance was evident,

You might also like