Professional Documents
Culture Documents
An Interim report was Submitted in partial fulfillment of the requirements for the
award of the degree of
BACHELOR OF TECHNOLOGY
in
by
Mrs. Kasthuri
Developer
Techciti Software Consulting Private
Limited
Company Logo
2021-22
TECHCITI SOFTWARE CONSULTING PVT LTD.
3rd Floor, BNR Complex. ,Sri Rama Layout,J.P Nagar - 7th Phase, Bangalore - 560078(Above:
State Bank of India)Bus stop: Brigade Millennium
Bonafide Certificate
Company Logo
This is to certify that this Interim report entitled “HOUSE PRICES PREDICTION
SYSTEM” submitted to the Department of Electronics & Communication Engineering, KL
Deemed to be University, Guntur, in connection with the University internship program is a
bonafide record of work done by “MADDASANI VENKTATA RAMI REDDY” under my
supervision at the “TECHCITI SOFTWARE CONSULTING PVT LTD” from “6TH DEC 2021”
to “19TH MAR 2022”.
Ms.Kasthuri K
Software Developer
K L Deemed to be UNIVERSITY
DEPARTMENT OF
DECLARATION
By
DEPARTMENT OF
CERTIFICATE
This is to certify that the Project Report entitled ““Houses Prices Prediction system” is being
submitted by “180040161- MADDASANI VENKATA RAMI REDDY” submitted in partial fulfillment
for the award of B. Tech in Electronics and communication engineering to KL University is a record of
bonified work carried out under guidance and supervision. The results in this report have not been copied
from any other departments/University/Institute.
I have taken efforts in this project. However, it would not have been possible without the
kind, support andhelp of Ms. Kasthuri.k Company Guide.
I am highly indebted to Ms. Kasthuri.k, Project Coordinator for her guidance and constant
supervision aswell as for providing necessary information regarding the project & also for
her support for completing the project.
We wish to express our sincere thanks to Dr. M Suman, Head of department of ECE
for providing anopportunity to undertake this project.
And we would wish to our sincere thanks to our University Guide, Mr.Vinay Atgur
for his valuableguidance and suggestions in completing our project successfully.
It would be unfair if we do not mention the invaluable contribution and timely cooperation
extended by staff members of our department. We would like to thank our Institution
without which this project wouldhave been a distant reality.
I would like to express our gratitude towards our parents for their kind co-operation and
encouragementwhich help us in completion of this project.
5
ABSTRACT
House price forecasting is an important topic of real estate. The literature attempts to
derive useful knowledge from historical data of property markets. Machine learning
techniques are applied to analyze historical property transactions in the USA to discover
useful models for house buyers and sellers. Revealed is the high discrepancy between
house prices in the most expensive and most affordable suburbs. Moreover, experiments
demonstrate that the Multiple Linear Regression that is based on mean squared error
measurement is a competitive approach. People are careful when they are trying to buy a
new house with their budgets and market strategies. The objective of the paper is to
forecast the coherent house prices for non-house holders based on their financial
provisions and their aspirations. By analyzing the foregoing merchandise, fare ranges and
also forewarns developments, speculated prices will be estimated. The paper involves
predictions using different Regression techniques like Multiple linear, Ridge, LASSO,
Elastic Net, Gradient boosting, and Ada Boost Regression. House price prediction on a
data set has been done by using all the above-mentioned techniques to find out the best
among them. The motive of this paper is to help the seller to estimate the selling cost of a
house perfectly and to help people to predict the exact time slap to accumulate a house.
Some of the related factors that impact the cost were also taken into consideration such as
physical conditions, concept, and location.
6
INDEX
CONTENTS PG NO
CHAPTER-1
1.1 INTRODUCTION 9
1.2 SOFTWARE DESCRIPTION 9
CHAPTER-II
2.1 DATA SET 10
2.2 DATA EXPLORATION 11
2.3 DATA VISUALIZATION 11
2.4 DATA SELECTION 12
2.5 DATA TRANSFORMATION 13
CHAPTER-III
3.1 PYTHON 14
3.2 NUMPY 14
3.3 PANDAS 17
3.4 MATPLOTLIB 18
3.5 SKIT LEARN 20
3.5 SEABORN 21
CHAPTER-IV
4.1 MACHINE LEARNING 24
4.2 MODELS USED 26
4.3 REGRESSION ANALYTICS 28
CHAPTER-V
5.1 HOUSE SALES PRICES USING LINEAR REGRESSION 32
5.2 DJANGO WEB FRAMEWORK 35
7
CHAPTER-VI
6.1 CODE 46
6.2 RESULTS AND DISCUSSION 49
6.3 BEST SUITED MODEL 51
CONCLUSION 51
REFERENCES 51
List of Figures
1 visualization graphs
8
CHAPTER-I
1.1 INTRODUCTION
There are some of the Parameters on which we will evaluate ourselves-Create an effective
price prediction model to validate the model’s prediction accuracy. Identify the important
home price attributes which feed the model’s predictive power. In the first model, a dataset
is present with inputs and known outputs. In the second one, the machine learns from a
dataset that comes with input variables only. In a reinforcement learning model, algorithms
are used to select an action. This project is implemented using supervised machine learning
algorithms. The outcome of our project is to make predictions on the sales prices of the
houses of California State with the dataset provided. It is hoped this study will inform better
analysis of gathered information (unanalyzed data) andother machine learning techniques.
Linear Regression is a Supervised Machine Learning Model for finding the relationship
between independent variables and dependent variables. Linear regression performs the
task to predict the response (dependent) variable value (y) based on a given (independent)
explanatory variable (x). So, this regression technique finds out a linear relationship
between x (input) and y (output).
9
CHAPTER- II
2.1 DATASET
Problem Statement – Real estate agents want help to predict the house price for regions in
the area. The dataset to work on and we decided to use the Linear Regression Model. Create
a model that will help him to estimate what the house would sell for.
The dataset contains 7 columns and 5000 rows with CSV extension. The data contains the
following columns :
• ‘Avg. Area Income’ – Avg. The income of the householder of the city house is located.
• ‘Avg. Area House Age’ – Avg. Age of Houses in the same city.
• ‘Avg. Area Number of Rooms’ – Avg. Number of Rooms for Houses in the same city.
• ‘Avg. Area Number of Bedrooms’ – Avg. Number of Bedrooms for Houses in the same
city.
House price prediction on a data set has been done by using all the above-mentioned
techniques to find out the best among them. The motive of this paper is to help the seller
to estimate the selling cost of a house perfectly and to help people to predict the exact time
slap to accumulate a house. Some of the related factors that impact the cost were also taken
into consideration such as physical conditions, concept, and location.
10
2.2 DATA EXPLORATION
Data exploration is the first step in data analysis and typically involves summarizing the
main characteristics of a data set, including its size, accuracy, initial patterns in the data
and other attributes. It is commonly conducted by data analysts using visual analytics tools,
but it can also be done in more advanced statistical software, Python. Before it can conduct
analysis on data collected by multiple data sources and stored in data warehouses, an
organization must know how many cases are in a data set, what variables are included,
how many missing values there are, and what general hypotheses the data is likely to
support. An initial exploration of the data set can help answer these questions by
familiarizing analysts with the data with which they are working. This area of data
exploration has become an area of interest in the field of machine learning. This is a
relatively new field and is still evolving. As its most basic level, a machine-learning
algorithm can be fed a data set and can be used to identify whether a hypothesis is true
based on the dataset. Common machine learning algorithms can focus on identifying the
specific pattern Many common patterns include regression and classification or clustering,
but there are many possible patterns and algorithms that can be applied to data via machine
learning.
11
Fig 1: visualization Graphs
Data selection is defined as the process of determining the appropriate data type and
source, as well as suitable instruments to collect data. Data selection precedes the actual
practice of data collection. This definition distinguishes data selection from selective data
reporting (selectively excluding data that is not supportive of a research hypothesis) and
interactive/active data selection (using collected data for monitoring activities/events or
conducting secondary data analyses). The process of selecting suitable data for a research
project can impact data integrity.
The primary objective of data selection is the determination of appropriate data type,
source, and instrument(s) that allow investigators to adequately answer research questions.
This determination is often discipline-specific and is primarily driven by the nature of the
investigation, existing literature, and accessibility to necessary data sources.
12
Fig 2: Correlation Heatmap
Data transformation is the process of changing the format, structure, or values of data. For
data analytics projects, data may be transformed at two stages of the data pipeline.
Organizations that use on-premises data warehouses generally use an ETL process, in
which data transformation in middle step. Today, most organizations use cloud-based data
warehouses, which can scale compute and storage resources with latency measured in
seconds or minutes. The scalability of the cloud platform lets organizations skip preload
transformations and load raw data into the data warehouse, then transform it at query time
— a model called ELT ( extract, load, transform).
In computing, data transformation is the process of converting data from one format or
structure into another format or structure. It is a fundamental aspect of most data
integration and data management tasks such as data wrangling, data warehousing, data
integration and application integration
13
CHAPTER-III
3.1 PYTHON
Python is a popular programming language. It was created by Guido van Rossum and
released in 1991. It is used for:
2) Software development,
3) Mathematics,
4) System scripting…
• Python can connect to database systems. It can also read and modify files.
• Python can be used to handle big data and perform complex mathematics.
• Python can be used for rapid prototyping, or for production-ready software development.
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
• Python has a syntax that allows developers to write programs with fewer lines than some
otherprogramming languages.
• Python runs on an interpreter system, meaning that code can be executed as soon as it is
written.This means that prototyping can be very quick.
3.2 NUMPY
The topic is very broad: datasets can come from a wide range of sourcesand a wide range
of formats, including be collections of documents, collections of images, collections of
sound clips, collections of numerical measurements, or nearly anything else. Despitethis
apparent heterogeneity, it will help us to think of all data fundamentally as arrays of
numbers.
14
dimensional arrays of numbers representing pixel brightness across the area. Sound clips
can be thought of as one-dimensional arrays of intensity versus time. Text can be converted
in various ways into numerical representations, perhaps binary digits representing the
frequency of certain words or pairs of words. No matter what the data are, the first step in
making it analyzable will beto transform them into arrays of numbers.
• Attributes of arrays: Determining the size, shape, memory consumption, and data types of
arrays
• Indexing of arrays: Getting and setting the value of individual array elements
• Slicing of arrays: Getting and setting smaller subarrays within a larger array
• Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array
into many
NumPy (short for Numerical Python) provides an efficient interface to store and
operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in “list”
type, but NumPyarrays provide much more efficient storage and data operations as the
arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of
data science tools in Python, so time spent learning to use NumPy effectively will be
valuable no matter what aspect of data science interests you.
If you followed the advice outlined in the Preface and installed the Anaconda stack,
you already have NumPy installed and ready to go. If you're more the do-it-yourself type,
you can go to http://www.numpy.org/ and follow the installation instructions found there.
First let’s discuss some useful array attributes. We’ll start by defining three random
arrays: a one-dimensional, two-dimensional, and three-dimensional array. We’ll use
NumPy’s random number generator, which we will seed with a set value in order to ensure
15
that the same random arrays are generated each time this code is run:
Fixed-Type Arrays in Python
Python offers several different options for storing data in efficient, fixed-type data
buffers. The built-in array module (available since Python 3.3) can be used to create dense
arrays of a uniform type:
Here 'i' is a type code indicating the contents are integers.Here I’m importing standard
NumPy, under the alias np:
We can use np.array to create arrays from Python lists.Unlike Python lists, NumPy
is constrained to arrays that all contain the same type. If types do not match, NumPy will
upcast if possible (here, integers are up-cast to floating point):
16
3.3 PANDAS
As we saw, NumPy's ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it serves this
purpose very well, its limitations become clear when we need more flexibility (e.g.,
attaching labels to data, working with missing data, etc.) and when attempting operations
that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of
which is an important piece of analyzingthe less structured data available in many forms
in the world around us. Pandas and its Series andDataFrame objects, builds on the NumPy
array structure and provides efficient access to these sorts of "data munging" tasks that
occupy much of a data scientist's time.
From what we’ve seen so far, it may look like the Series object is basically
interchangeable with a one-dimensional NumPy array. The essential difference is the
presence of the index: whilethe NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with the
values. This explicit index definition gives the Series object additional capabilities. For
17
example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:
3.4 MATPLOTLIB
We’ll now look at the Matplotlib tool for visualization in Python. Matplotlib is a
multiplatformdata visualization library built on NumPy arrays and designed to work with
the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to
IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython
command line. IPython’screator, Fernando Perez, was at the time scrambling to finish his
PhD, and let John know he wouldn’t have time to review the patch for several months. John
took this as a cue to set out on hisown, and the Matplotlib package was born, with version
0.1 released in 2003.
One of Matplotlib’s most important features is its ability to play well with many
operating systems and graphics backends. This cross-platform, everything-to-everyone
approach has been one of the great strengths of Matplotlib. For this reason, I believe that
Matplotlib itself will remaina vital piece of the data visualization stack, even if new tools
mean the community gradually movesaway from using the Matplotlib API directly.
We’ll now look at the Matplotlib tool for visualization in Python. Matplotlib is a
multiplatformdata visualization library built on NumPy arrays and designed to work with
the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to
IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython
command line. IPython’screator, Fernando Perez, was at the time scrambling to finish his
PhD, and let John know he wouldn’t have time to review the patch for several months. John
18
took this as a cue to set out on hisown, and the Matplotlib package was born, with version
0.1 released in 2003.
One of Matplotlib’s most important features is its ability to play well with many
operating systems and graphics backends. This cross-platform, everything-to-everyone
approach has been one of the great strengths of Matplotlib. For this reason, I believe that
Matplotlib itself will remaina vital piece of the data visualization stack, even if new tools
mean the community gradually movesaway from using the Matplotlib API directly.
Importing matplotlib
Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we
will usesome standard shorthands for Matplotlib imports:
Basic Errorbars
19
3.5 SCIKIT-LEARN
There are several Python libraries that provide solid implementations of a range of
machine learning algorithms. One of the best known is Scikit-Learn, a package that
provides efficient versions of a large number of common algorithms. Scikit-Learn is
characterized by a clean, uniform, and streamlined API, as well as by very useful and
complete online documentation. A benefit of this uniformity is that once you understand
the basic use and syntax of Scikit-Learn forone type of model, switching to a new model
or algorithm is very straightforward. Machine learning is about creating models from data:
for that reason, we’ll start by discussing how data can be represented in order to be
understood by the computer. The best way to think about data within Scikit-Learn is in
terms of tables of data. For example, consider the Irisdataset, famously analyzed by Ronald
Fisher in 1936.
Machine learning is about creating models from data: for that reason, we’ll start by
discussing how data can be represented in order to be understood by the computer. The
best way to think about data within Scikit-Learn is in terms of tables of data. For example,
consider the Irisdataset, famously analyzed by Ronald Fisher in 1936. We can download
this dataset in the form of a Pandas DataFrame using the Seaborn library:
iris = sns.load_dataset('iris')iris.head()
Out[1]: sepal_length sepal_width petal_length petal_widthspecies
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
A benefit of this uniformity is that once you understand the basic use and syntax of
Scikit-Learn forone type of model, switching to a new model or algorithm is very
straightforward. Machine learning is about creating models from data: for that reason,
we’ll start bydiscussing how data can be represented in order to be understood by the
computer
20
3.6 SEABORN
Seaborn helps you explore and understand your data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its dataset-
oriented, declarative API lets you focus on what the different elements of your plots
mean, rather than on the details of how to draw them.
Seaborn is a data visualization library built on top of matplotlib and closely integrated with
pandas data structures in Python. Visualization is the central part of Seaborn which helps in
exploration and understanding of data.
One has to be familiar with Numpy and Matplotlib and Pandas to learn about Seaborn.
These are only some of the functionalities offered by Seaborn, there are many more of them,
and we can explore all of them here.
1. Distribution Plots
21
3. Scatter Plots
4. Pair Plots
5. Heat maps
1. Distribution Plots
We can compare the distribution plot in Seaborn to histograms in Matplotlib. They both
offer pretty similar functionalities. Instead of frequency plots in the histogram, here we’ll
plot an approximate probability density across the y-axis.
We will be using sns.distplot() in the code to plot distribution graphs.Before going further,
first, let’s access our datase
Pie Chart is generally used to analyze the data on how a numeric variable changes across
different categories.In the dataset we are using, we’ll analyze how the top 4 categories in
the Content Rating column is performing.From the above Pie diagram, we cannot correctly
infer whether “Everyone 10+” and “Mature 17+”. It is very difficult to assess the difference
between those two categories when their values are somewhat similar to each other.
We can overcome this situation by plotting the above data in Bar chart.
Bar Chart for Content rating column. Similar to Pie Chart, we can customize our Bar Graph
too, with different Colors of Bars, the title of the chart, etc.
3. Scatter Plots
Up until now, we have been dealing with only a single numeric column from the dataset,
like Rating, Reviews or Size, etc. But, what if we have to infer a relationship between two
numeric columns, say “Rating and Size” or “Rating and Reviews”.
22
Scatter Plot is used when we want to plot the relationship between any two numeric columns
from a dataset. These plots are the most powerful visualization tools that are being used in
the field of machine learning.
Let’s see how the scatter plot looks like for two numeric columns in the dataset “Rating” &
“Size”. First, we’ll plot the graph using matplotlib after that we’ll see how it looks like in
seaborn.We will be using sns.joinplot() in the code for scatter plot along with the
histogram.sns.scatterplot() in the code for only scatter plots.
Scatter Plot using SeabornThe main advantage of using a scatter plot in seaborn is, we’ll
get both the scatter plot and the histograms in the graph.If we want to see only the scatter
plot instead of “jointplot” in the code, just change it with “scatterplot”
4. Pair Plots
Pair Plots are used when we want to see the relationship pattern among more than 3 different
numeric variables. For example, let’s say we want to see how a company’s sales are affected
by three different factors, in that case, pair plots will be very helpful.
23
CHAPTER-IV
Starting from the analysis of a known training dataset, the learning algorithm
produces aninferred function to make predictions about the output values. The system is
able to provide targetsfor any new input after sufficient training. The learning algorithm can
also compare its output withthe correct, intended output and find errors in order to modify
the model accordingly.
These are used when the information used to train is neither classified nor labeled.
Unsupervised learning studies how systems can infer a function to describe a hidden
structure fromunlabeled data. The system explores the data and can draw inferences from
datasets to describe hidden structures from unlabeled data.
24
and a large amount of unlabeled data. Usually, semi-supervised learning is chosen
when theacquired labeled data requires skilled and relevant resources in order to train it /
learn from it.
Consider some of the instances where machine learning is applied: the self-driving
Googlecar, cyber fraud detection, online recommendation engines like friend suggestions
on Facebook, Netflix showcasing the movies and shows you might like, and “more items
to consider” and “get yourself a little something” on Amazon are all examples of applied
machine learning. All these examples echo the vital role machine learning has begun to
take in today’s data-rich world.
Machines can aid in filtering useful pieces of information that help in major
advancements, and we are already seeing how this technology is being implemented in a
wide variety of industries.The process flow depicted here represents how machine learning
works
25
4.2 MODELS USED
• It is mostly used for finding out the relationship between variables and forecasting.
Real Vs Predicted
26
4.2.2 Random Forest Regression model
• Bagging, in the Random Forest method, involves training each decision tree on a different
data sample where sampling is done with replacement.
• The basic idea behind this is to combine multiple decision trees in determining the final
output rather than relying on individual decision trees.
Real Vs Predicted
27
4.3 REGRESSION ANALYSIS
Regression analysis is a predictive modelling technique that analyzes the relation between
the target or dependent variable and independent variable in a dataset. The different types
of regression analysis techniques get used when the target and independent variables
show a linear or non-linear relationship between each other, and the target variable contains
continuous values. The regression technique gets used mainly to determine the predictor
strength, forecast trend, time series, and in case of cause & effect relation.
Regression analysis is the primary technique to solve the regression problems in machine
learning using data modelling. It involves determining the best fit line, which is a line that
passes through all the data points in such a way that distance of the line from each data
point is minimized.
Types of Regression Analysis Techniques
There are many types of regression analysis techniques, and the use of each method
depends upon the number of factors. These factors include the type of target variable, shape
of the regression line, and the number of independent variables.
Below are the different regression techniques:
1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
6. Bayesian Linear Regression
The different types of regression in machine learning techniques are explained below
in detail:
1. Linear Regression
Linear regression is one of the most basic types of regression in machine learning. The
linear regression model consists of a predictor variable and a dependent variable related
28
linearly to each other. In case the data involves more than one independent variable, then
linear regression is called multiple linear regression models.
The below-given equation is used to denote the linear regression model:
y=mx+c+e
where m is the slope of the line, c is an intercept, and e represents the error in the model.
The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
selected in such a way that it gives the minimum predictor error. It is important to note that
a simple linear regression model is susceptible to outliers. Therefore, it should not be used
in case of big size data.
2. Logistic Regression
Logistic regression is one of the types of regression analysis technique, which gets used
when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the
target variable can have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.
Logit function is used in Logistic Regression to measure the relationship between the target
variable and independent variables. Below is the equation that denotes the logistic
regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.
For selecting logistic regression, as the regression analyst technique, it should be noted,
the size of data is large with the almost equal occurrence of values to come in target
variables. Also, there should be no multicollinearity, which means that there should be no
correlation between independent variables in the dataset.
3. Ridge Regression
This is another one of the types of regression in machine learning which is usually used
when there is a high correlation between the independent variables. This is because, in the
29
case of multi collinear data, the least square estimates give unbiased values. But, in case
the collinearity is very high, there can be some bias value. Therefore, a bias matrix is
introduced in the equation of Ridge Regression. This is a powerful regression method
where the model is less susceptible to overfitting.
Below is the equation used to denote the Ridge Regression, where the introduction of λ
(lambda) solves the problem of multicollinearity:
β = (X^{T}X + λ*I)^{-1}X^{T}y
3 Lasso Regression
Lasso Regression is one of the types of regression in machine learning that performs
regularization along with feature selection. It prohibits the absolute size of the regression
coefficient. As a result, the coefficient value gets nearer to zero, which does not happen in
the case of Ridge Regression.
Due to this, feature selection gets used in Lasso Regression, which allows selecting a set
of features from the dataset to build the model. In the case of Lasso Regression, only the
required features are used, and the other ones are made zero. This helps in avoiding the
overfitting in the model. In case the independent variables are highly collinear, then Lasso
regression picks only one variable and makes other variables to shrink to zero.
4 .Polynomial Regression
30
While trying to reduce the Mean Squared Error to a minimum and to get the best fit line,
the model can be prone to overfitting. It is recommended to analyze the curve towards the
end as the higher Polynomials can give strange results on extrapolation.
Below equation represents the Polynomial Regression:
l = β0+ β0x1+ε
Within multiple types of regression models, it is important to choose the best suited
technique based on type of independent and dependent variables, dimensionality in the
data and other essential characteristics of the data.
1. Data exploration is an inevitable part of building predictive model. It should be you first
step before selecting the right model like identify the relationship and impact of variables
2. To compare the goodness of fit for different models, we can analyse different metrics like
statistical significance of parameters, R-square, Adjusted r-square, AIC, BIC and error
term. Another one is the Mallow’s Cp criterion. This essentially checks for possible bias
in your model, by comparing the model with all possible submodels (or a careful selection
of them).
3. Cross-validation is the best way to evaluate models used for prediction. Here you divide
your data set into two group (train and validate). A simple mean squared difference
between the observed and predicted values give you a measure for the prediction accuracy.
4. If your data set has multiple confounding variables, you should not choose automatic
model selection method because you do not want to put these in a model at the same time.
5. It’ll also depend on your objective. It can occur that a less powerful model is easy to
implement as compared to a highly statistically significant model.
6. Regression regularization methods(Lasso, Ridge and ElasticNet) works well in case of
high dimensionality and multicollinearity among the variables in the data set.
31
CHAPTER-V
The dataset is the prices and features of residential houses sold from 2006 to 2010 in
Ames,Iowa. Check the references for the dataset link.
To get familiar with this machine learning approach, we'll work with a dataset on sold
housesin Ames, Iowa. Each row in the dataset describes the properties of a single house as
32
well as the amount it was sold for. In this course, we'll build models that predict the final
sale price from its other attributes. Specifically, we'll explore the following questions:
This dataset was originally compiled by Dean De Cock for the primary purpose of
having ahigh-quality dataset for regression. Here are some of the columns:
We'll start by understanding the univariate case of linear regression, also known as
simple linear regression. The following equation is the general form of the simple linear
regression model.
y^=ax1+a0
y^ represents the target column while x1 represents the feature column we choose
to use inour model. These values are independent of the dataset. On the other hand, a0 and
a1represent theparameter values that are specific to the dataset. The goal of simple linear
regression is to find the
optimal parameter values that best describe the relationship between the feature column
and the target column. The following diagram shows different simple linear regression
models dependingon the data:
The first step is to select the feature, x1, we want to use in our model. Once we
select thisfeature, we can use scikit-learn to determine the optimal parameter values a1 and
a0based on the training data.
33
• Create a figure with dimensions 7 x 15 containing three scatter plots in a single column:
• The first plot should plot the Garage Area column on the x-axis against the SalePrice
columnon the y-axis.
• The second one should plot the Gr Liv Area column on the x-axis against the SalePrice
columnon the y-axis.
• The third one should plot the Overall Cond column on the x-axis against the SalePrice
columnon the y-axis.
Mission 4: Least Squares
From the last screen, we can tell that the Gr Liv Area feature correlates the most
with the SalePrice column. We can confirm this by calculating the correlation between
pairs of these columns using the pandas.DataFrame.corr() method:
print(train[['Garage Area', 'Gr Liv Area', 'Overall Cond', 'SalePrice']].corr())
The correlation between Gr Liv Area and SalePrice is around 0.706, which is the
highest. Recall that the closer the correlation coefficient is to 1.0, the stronger the
correlation. Here's the updated form of our model:
y^=a1∗Gr Liv Area+a0 Let's now move on to understanding the model fitting criteria.
Let's now use scikit-learn to find the optimal parameter values for our model. The
scikitlearn library was designed to easily swap and try different models. Because we're
familiar with the scikit-learn workflow for k-nearest neighbors, switching to using linear
regression is straightforward. 34 Instead of working with the
sklearn.neighbors.KNeighborsRegressors class, we work with the
sklearn.linear_model.LinearRegression class. The LinearRegression class also has it's own
fit() method. Specific to this model, however, is the coef_ and intercept_ attributes, which
returns a1 (a1 to an if it were a multivariate regression model) and a0 accordingly.
In the last step, we fit a univariate linear regression model between the Gr Liv Area
and SalePrice columns. We then displayed the single coefficient and the residuel value. If
we refer back to the format of our linear regression model, the fitted model can be
34
represented as: y^=116.86624683x1+5366.82171006
One way to interpret this model is "for every 1 square foot increase in above ground
livingarea, we can expect the home's value to increase by approximately 116.87 dollars".
We can now use the predict() method to predict the labels using the training data
and compare them with the actual labels. To quantify the fit, we can use mean squared error.
Let's alsoperform simple validation by making predictions on the test set and calculate the
MSE value for those predictions as well
Django is a python -based free and open source web framework that follows the model–
template–views (MTV) architectural pattern. It is maintained by the Django software
foundation (DSF), an independent organization established in the US as a non-profit.
Django's primary goal is to ease the creation of complex, database-driven websites. The
framework emphasizes reusability and "pluggability" of components, less code, low
coupling, rapid development, and the principle of don’t repeat yourself is used throughout,
even for settings, files, and data models. Django also provides an optional
administrative create, update and delete interface that is generated dynamically
through introspection and configured via admin models.Some well-known sites that use
Django include Instagram, Mozilla, clubhouse and bitbucket.
We used the Django web framework to deploy the model where the front and backend was
perfectly developed as per the industry requirement.
Django is a high-level Python web framework that enables rapid development of secure
and maintainable websites. Built by experienced developers, Django takes care of much of
the hassle of web development, so you can focus on writing your app without needing to
reinvent the wheel. It is free and open source, has a thriving and active community, great
documentation, and many options for free and paid-for support.
Django follows the "Batteries included" philosophy and provides almost everything
developers might want to do "out of the box". Because everything you need is part of the
35
one "product", it all works seamlessly together, follows consistent design principles, and
has extensive and up-to-date documentation.
Versatile
Django can be (and has been) used to build almost any type of website — from content
management systems and wikis, through to social networks and news sites. It can work
with any client-side framework, and can deliver content in almost any format (including
HTML, RSS feeds, JSON, XML, etc). The site you are currently reading is built with
Django!
Internally, while it provides choices for almost any functionality you might want (e.g.
several popular databases, templating engines, etc.), it can also be extended to use other
components if needed.
Secure
Django helps developers avoid many common security mistakes by providing a framework
that has been engineered to "do the right things" to protect the website automatically. For
example, Django provides a secure way to manage user accounts and passwords, avoiding
common mistakes like putting session information in cookies where it is vulnerable
(instead cookies just contain a key, and the actual data is stored in the database) or directly
storing passwords rather than a password hash.
Scalable
36
Having a clear separation between the different parts means that it can scale for increased
traffic by adding hardware at any level: caching servers, database servers, or application
servers. Some of the busiest sites have successfully scaled Django to meet their demands
(e.g. Instagram and Disqus, to name just two).
Maintainable
Django code is written using design principles and patterns that encourage the creation of
maintainable and reusable code. In particular, it makes use of the Don't Repeat Yourself
(DRY) principle so there is no unnecessary duplication, reducing the amount of code.
Django also promotes the grouping of related functionality into reusable "applications"
and, at a lower level, groups related code into modules (along the lines of the Model View
Controller (MVC) pattern).
Portable
Django is written in Python, which runs on many platforms. That means that you are not
tied to any particular server platform, and can run your applications on many flavours of
Linux, Windows, and Mac OS X. Furthermore, Django is well-supported by many web
hosting providers, who often provide specific infrastructure and documentation for hosting
Django sites.
Django was initially developed between 2003 and 2005 by a web team who were
responsible for creating and maintaining newspaper websites. After creating a number of
sites, the team began to factor out and reuse lots of common code and design patterns. This
common code evolved into a generic web development framework, which was open-
sourced as the "Django" project in July 2005.
Django has continued to grow and improve, from its first milestone release (1.0) in
September 2008 through to the recently-released version 3.1 (2020). Each release has
added new functionality and bug fixes, ranging from support for new types of databases,
template engines, and caching, through to the addition of "generic" view functions and
classes (which reduce the amount of code that developers have to write for a number of
programming tasks).
Note: Check out the release notes on the Django website to see what has changed in recent
versions, and how much work is going into making Django better.
37
Django is now a thriving, collaborative open source project, with many thousands of users
and contributors. While it does still have some features that reflect its origin, Django has
evolved into a versatile framework that is capable of developing any type of website.
Based on the number of high profile sites that use Django, the number of people
contributing to the codebase, and the number of people providing both free and paid for
support, then yes, Django is a popular framework!
High-profile sites that use Django include: Disqus, Instagram, Knight Foundation,
MacArthur Foundation, Mozilla, National Geographic, Open Knowledge Foundation,
Pinterest, and Open Stack (source: Django overview page).
Is Django opinionated?
Opinionated frameworks are those with opinions about the "right way" to handle any
particular task. They often support rapid development in a particular domain (solving
problems of a particular type) because the right way to do anything is usually well-
understood and well-documented. However they can be less flexible at solving problems
outside their main domain, and tend to offer fewer choices for what components and
approaches they can use.
Unopinionated frameworks, by contrast, have far fewer restrictions on the best way to glue
components together to achieve a goal, or even what components should be used. They
make it easier for developers to use the most suitable tools to complete a particular task,
albeit at the cost that you need to find those components yourself.
Django is "somewhat opinionated", and hence delivers the "best of both worlds". It
provides a set of components to handle most web development tasks and one (or two)
38
preferred ways to use them. However, Django's decoupled architecture means that you can
usually pick and choose from a number of different options, or add support for completely
new ones if desired.
In a traditional data-driven website, a web application waits for HTTP requests from the
web browser (or other client). When a request is received the application works out what
is needed based on the URL and possibly information in POST data or GET data.
Depending on what is required it may then read or write information from a database or
perform other tasks required to satisfy the request. The application will then return a
response to the web browser, often dynamically creating an HTML page for the browser
to display by inserting the retrieved data into placeholders in an HTML template.
Django web applications typically group the code that handles each of these steps into
separate files:
Fig6:Djangoworkingblockdiagram
• URLs: While it is possible to process requests from every single URL via a single function,
it is much more maintainable to write a separate view function to handle each resource. A
39
URL mapper is used to redirect HTTP requests to the appropriate view based on the request
URL. The URL mapper can also match particular patterns of strings or digits that appear
in a URL and pass these to a view function as data.
• View: A view is a request handler function, which receives HTTP requests and returns
HTTP responses. Views access the data needed to satisfy requests via models, and delegate
the formatting of the response to templates.
• Models: Models are Python objects that define the structure of an application's data, and
provide mechanisms to manage (add, modify, delete) and query records in the database.
• Templates: A template is a text file defining the structure or layout of a file (such as an
HTML page), with placeholders used to represent actual content. A view can dynamically
create an HTML page using an HTML template, populating it with data from a model. A
template can be used to define the structure of any type of file; it doesn't have to be HTML!
Install Django
• Install an oflcial release. This is the best approach for most users.
• Install the latest development version. This option is for enthusiasts who want the latest-
and-greatest features and aren’t afraid of running brand new code. You might encounter
new bugs in the development version, but reporting them helps the development of Django.
Also, releases of third-party packages are less likely to be compatible with the development
version than with the latest stable release.
Always refer to the documentation that corresponds to the version of Django you’re
using!
Throughout this tutorial, we’ll walk you through the creation of a basic poll application.
It’ll consist of two parts:
• A public site that lets people view polls and vote in them.
40
• An admin site that lets you add, change, and delete polls.
We’ll assume you have Django installed already. You can tell Django is installed and
If Django is installed, you should see the version of your installation. If it isn’t, you’ll get
an error telling “No module named django”.
This tutorial is written for Django 2.2, which supports Python 3.5 and later. If the Django
version doesn’t match, you can refer to the tutorial for your version of Django by using the
version switcher at the bottom right corner of this page, or update Django to the newest
version. If you’re using an older version of Python, check What Python version can I use
with Django? to find a compatible version of Django.
See How to install Django for advice on how to remove older versions of Django and install
a newer one.
Creating a project
some code that establishes a Django project – a collection of settings for an instance of
Django, including database configuration, Django-specific options and application-
specific settings.
From the command line, cd into a directory where you’d like to store your code, then run
the following command:
41
wsgi.py
These files are:
• manage.py: A command-line utility that lets you interact with this Django project in
various ways. You can read all the details about manage.py in django-admin and
manage.py.
• The inner mysite/ directory is the actual Python package for your project. Its name is the
Python package name you’ll need to use to import anything inside it (e.g. mysite.urls).
mysite/ init .py: An empty file that tells Python that this directory should be considered a
Python pack- age. If you’re a Python beginner, read more about packages in the official
Python docs.
• mysite/urls.py: The URL declarations for this Django project; a “table of contents” of
your Django-powered site. You can read more about URLs in URL dispatcher.
Let’s verify your Django project works. Change into the outer mysite directory, if you
You have unapplied migrations; your app may not work properly until they are
applied.Run 'python manage.py migrate' to apply them.
42
the database shortly.
You’ve started the Django development server, a lightweight Web server written purely in
Python. We’ve included this with Django so you can develop things rapidly, without having
to deal with configuring a production server – such as Apache – until you’re ready for
production.
Now’s a good time to note: don’t use this server in anything resembling a production
environment. It’s intended only for use while developing. (We’re in the business of making
Web frameworks, not Web servers.)
Now that the server’s running, visit http://127.0.0.1:8000/ with your Web browser. You’ll
see a “Congratulations!” page, with a rocket taking off. It worked!
By default, the runserver command starts the development server on the internal IP at
port 8000.
If you want to change the server’s port, pass it as a command-line argument. For
If you want to change the server’s IP, pass it along with the port. For example, to listen
on all available public IPs (which is useful if you are running Vagrant or want to show off
your work on other computers on the network), use:
Now that your environment – a “project” – is set up, you’re set to start doing work.
Each application you write in Django consists of a Python package that follows a certain
convention. Django comes with a utility that automatically generates the basic directory
structure of an app, so you can focus on writing code rather than creating directories.
43
Projects vs. apps
What’s the difference between a project and an app? An app is a Web application that does
something – e.g., a Weblog system, a database of public records or a simple poll app. A
project is a collection of configuration and apps for a particular website. A project can
contain multiple apps. An app can be in multiple projects.
Your apps can live anywhere on your Python path. In this tutorial, we’ll create our poll app
right next to your manage.py
file so that it can be imported as its own top-level module, rather than a submodule of
mysite.
To create your app, make sure you’re in the same directory as manage.py and type this
$ python manage.py startapp polls
command:
polls/
init .py
admin.py
apps.py
migrations/
init .py
models.py
tests.py
views.py
Let’s write the first view. Open the file polls/views.py and put the following
Python code in it:
def index(request):
return HttpResponse("Hello, world. You're at the polls index.")
44
This is the simplest view possible in Django. To call the view, we need to map it to a
Prediction page
Database setup
Now, open up mysite/settings.py. It’s a normal Python module with module-level variables
representing Django settings.
By default, the configuration uses SQLite. If you’re new to databases, or you’re just
interested in trying Django, this is the easiest choice. SQLite is included in Python, so you
won’t need to install anything else to support your database. When starting your first real
project, however, you may want to use a more scalable database like PostgreSQL, to avoid
database-switching headaches down the road.
If you wish to use another database, install the appropriate database bindings and change
the following keys in the
DATABASES 'default' item to match your database connection settings:
• NAME – The name of your database. If you’re using SQLite, the database will be a file on
your computer; in that case, NAME should be the full absolute path, including filename,
of that file. The default value, os.path.join(BASE_DIR, 'db.sqlite3'), will store the file
in your project directory.
If you’re using a database besides SQLite, make sure you’ve created a database by this
point. Do that with “CREATEDATABASE database_name;” within your database’s
interactive prompt.
Also make sure that the database user provided in mysite/settings.py has “create database”
privileges. This allows automatic creation of a test database which will be needed in a
later tutorial.
45
CHAPTER-VI
6.1 CODE
import pandas as pd
import numpy as np
def home(request):
return render(request,"home.html")
def predict(request):
return render(request,"predict.html")
def result(request):
data = pd.read_csv(r'C:\Users\mvrre\Downloads\USA_Housing.csv')
data = data.drop(['Address'],axis=1)
x = data.drop('Price', axis=1)
y = data['Price']
46
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.30)
model = LinearRegression()
model.fit(x_train, y_train)
var1 =float(request.GET['n1'])
var2 =float(request.GET['n2'])
var3 =float(request.GET['n3'])
var4 =float(request.GET['n4'])
var5 =float(request.GET['n5'])
pred = round(pred[0])
def index(request):
if request.method == 'POST':
member = Member(username=request.POST['username'],
password=request.POST['password'], firstname=request.POST['firstname'],
lastname=request.POST['lastname'])
member.save()
return redirect('/')
47
else:
def login(request):
def home(request):
if request.method == 'POST':
if Member.objects.filter(username=request.POST['username'],
password=request.POST['password']).exists():
member = Member.objects.get(username=request.POST['username'],
password=request.POST['password'])
else:
48
6.2 RESULTS
49
Fig 9: Home Page of house prices predition
50
6.3 BEST SUITED MODEL
Linear Regression displayed the best performance for this Dataset and can be used for
deploying purposes.
Random Forest Regressor and XGBoost Regressor are far behind, so can’t be
recommended for further deployment purposes.
CONCLUSION
So, our Aim is achieved as we have successfully ticked all our parameters as mentioned in
our Aim Column. It is seen that circle rate is the most effective attribute in predicting the
house price and that the Linear Regression is the most effective model for our Dataset and
we deployed the Linear Regression model using the Django web framework for prediction
the prices in different values
REFERENCES
[1]“Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-
learn,and TensorFlow”, Packt Publishing, 2 edition, Sebastian Raschka & Vahid Mirjalili
[2]“Data Mining: Practical Machine Learning Tools and Techniques”, 4th edition,
Published byMorgan Kaufmann, Ian H. Witten, Eibe Frank, Mark A.Hall, Christoper J.
Pal
[3] “Python Data Science Handbook”, Published by O’Reilly Media, Jake VanderPlas
51