Professional Documents
Culture Documents
Submitted by
Name : Basavaspoorti Vantamuri
Company Name
Guide Co-Guide
Name: Shrikant Athanikar Name: Ramakrishna S
Designation: Assistant Professor Designation: Technical Director
Affiliation: Affiliation: Director
S.S.E. T’S
CERTIFICATE
This is to Certify that the Machine Learning Internship Project work entitled “House Price Prediction” is
carried out at “Digital Geetham”, Bangalore by Ms. Basavaspoorti Vantamuri (2BU18CS400), a bonafide
student of S G BALEKUNDRI INSTITUTE OF TECHNOLOGY, Belagavi, in partial fulfillment for
the award of Bachelor of Engineering in Computer Science and Engineering of VISVESVARAYA
TECHNOLOGICAL UNIVERSITY, Belagavi during the year 2020-21. It is certified that all
corrections/suggestions indicated for internal assessment have been incorporated in the report deposited in
the departmental library. The internship report has been approved as it satisfies the academic requirements
in respect of Internship work prescribed for the said degree.
I Basavaspoorti Vantamuri , bearing USN 2BU18CS400, student of seventh semester B.E in Computer
Science Engineering, S. G. Balekundri Institute of Technology , Belagavi, hereby declare that the Internship
Project work entitled “HOUSE PRICE PREDICTION ” has been carried out and duly executed by me at
“Digital Geetham”, Bangalore under the guidance of Mr. Ramakrishna S Technical Director Digital
Geetham Bangalore and Mr. Manoj M.S Director, Digital Geetham, Bangalore submitted in partial
fulfillment of the requirements for the award of degree of Bachelor of Engineering in Computer
Science and Engineering by Visvesvaraya Technological University, Belagavi during the academic
year 2020- 2021.
1. INTRODUCTION 1
2. COMPANY DESCRIPTION 5
Mission
Vision
4. METHODLOGY 13
13
4.1 Linear regression
5. EXPECTED OUTCOME 17
Outcome of Internship 28
REFERENCES 29
House Price Prediction
Chapter 1
INTRODUCTION
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people. Although machine learning is a field within computer
science, it differs from traditional computational approaches. In traditional computing,
algorithms are sets of explicitly programmed instructions used by computers to calculate or
problem solve. Machine learning algorithms instead allow for computers to train on data inputs
and use statistical analysis in order to output values that fall within a specific range. Because of
this, machine learning facilitates computers in building models from sample data in order to
automate decision making processes based on data inputs.
Machine learning is a form of artificial intelligence (AI) that teaches computers to think in a
similar way to how humans do learn and improving upon past experiences. It works by
exploring data, identifying patterns, and involves minimal human intervention. Almost any task
that can be completed with a data- defined pattern or set of rules can be automated with machine
learning. This allows companies to transform processes that were previously only possible for
humans to perform think responding to customer service calls, bookkeeping, and reviewing
resumes.
What Is Learning? Rats Learning to Avoid Poisonous Baits: Rats normally stumble upon food
items by its look and smell and start eating in small amounts and later depending on food and
physiological effect the feeding of food goes on. If the rat notices the illness of the food, the rat
will not touch that food. Similarly the machine learning mechanism plays a vital role same as
animal usage of past experience for acquiring and expertise in detecting the food safety. By
mistake if the knowledge with the food is negatively labeled, the prediction of the animal will
also will be negatively affected and encountered in the future. With the inspiration of the
previous example of successful learning we demonstrate a typical machine learning algorithm.
Likewise we would want to program a machine that learns how to filter spam e-mails. A
credulous result might be apparently comparable of the lifestyle of rats that, how to keep away
from poisonous baits.
The machine will basically remember the past e-mails that needed been named similarly as
spam e-mails by the human user. When another email arrives, the machine will look for it in
the past set about spam e-mails. Though it matches among them, it will be trashed. Otherwise,
it will make moved of the user’s inbox organizer.
At the same time the first “learning by memorization” methodology may be useful, it fails to
offer an important aspect known as learning systems – the capacity to mark unseen email
messages. A fruitful learner ought to have the ability which will be the advancement from
distinctive samples to more extensive generalization. This may be Likewise as inductive
thinking or inductive induction. In the attraction nervousness exhibited previously, after the rats
experience a sample of a certain sort about food; they apply their disposition at it once new,
unseen illustrations from claiming nourishment of comparable emanation Also taste. Should
attain generalization in the spam sifting task, those learner might examine the Awhile ago seen
e-mails, and extricate An situated about expressions whose presence for a email message is
characteristic of spam. Then, at another email arrives, those machine could weigh if a standout
among the suspicious expressions gives the idea on it, and foresee its mark appropriately. Such
an arrangement might possibly have the ability effectively to foresee the name about unseen e-
mails.
Responsibilities further than Human Capabilities: an additional totally crew about errands that
profit starting with machine Taking in systems are identified with the Investigation for
extremely substantial and intricate information sets: galactic data, turning restorative chronicles
under restorative knowledge, climate prediction, and dissection of genomic data, Web serch
engines, Also electronic trade. With an ever increasing amount accessible digitally recorded
data, it gets evident that there would treasures about serious majority of the data covered
Clinched alongside information chronicles that would best approach excessively little and
also as well perplexing to people with bode well about. Taking in with recognize serious
examples over substantial Also complex information sets may be a guaranteeing space for
which the blending of projects that take for the Just about boundless memory limit and ever
expanding transforming velocity about PCs opens up new horizons.
Regulated versus Unsupervised, since taking in includes an association between those learner
and the environment, you quit offering on that one might separate taking in assignments as
stated by those nature for that connection. The to start with qualification will note is the Contrast
the middle of regulated and unsupervised Taking in. Likewise an illustrative example, think
about that errand for Taking in will recognize spam email versus the undertaking about
aberrance identification. For the spam identification task, we think about a setting to which the
learner receives preparing e-mails to which the mark spam/not-spam may be Gave. On the
support from claiming such preparation the learner ought to further bolstering to evaluate a
tenet to labelling a recently arriving email message. For contrast, for those assignment about
aberrance detection, every last one of learner gets Concerning illustration preparing is an
extensive form of email messages (with no labels) and the learner’s errand is on identify
“unusual” messages.
CHAPTER 2
COMPANY DESCRIPTION
Company Profile:
The Digital Geetham was originally incepted in 2017 to focus only on EPC assignment of Infra
and Real Estate Projects. Later the Company shifted its focus towards software development,
Training, BPO, Sourcing, Food Business, Health Care and Strategic Advisory Services. The
Northern software was born in 2013 with an objective to create a landmark initiative by a group
of highly qualified technology-oriented professionals in the software domain. A Software
development Firm head quartered in Vijaypur and operating for 5 years in with 3 offices across
Karnataka. Since its inception in 2013, Tech fortune group has grown rapidly with the help of
its valued Customers, professionals & Business Associates who have been continuously
contributing and monitoring the Company’s business activities in the operations of Project
Management, Education Consultancy, QMS and Six Sigma implementation and many other
domains of expertise.
Digital Geetham, is an emerging technology organization in the fields of business process
outsourcing, software development, end-to-end ERP solutions, Artificial Intelligence, Block
chain technology with a focus on providing customized solutions to the various businessneeds
of a diverse global clientele.
Mission:
Being slow and steady, our mission is to gain the confidence of our clients and by dint of our
integrity, innovation and dynamism, deliver their requirements on time with full quality thus
bridging the gap between demand and delivery.
Vision:
With an unyielding focus on integrity and backed by strong founders and management team,
Sourcing wants to make a mark in the field of IT services by applying innovation to simplify
complex business processes and add value to clients’ business.
CHAPTER 3
Objective: -
Machine learning plays a major role from past years in image detection, spam reorganization,
normal speech command, product recommendation and medical diagnosis. Present machine
learning algorithm helps us in enhancing security alerts, ensuring public safety and improve
medical enhancements. Machine learning system also provides better customer service and
safer automobile systems. In the present paper we discuss about the prediction of future
housing prices that is generated by machine learning algorithm. For the selection of prediction
methods we compare and explore various prediction methods. We utilize lasso regression as
our model because of its adaptable and probabilistic methodology on model selection. Our
result exhibit that our approach of the issue need to be successful, and has the ability to process
predictions that would be comparative with other house cost prediction models. More over on
other hand housing value indices, the advancement of a housing cost prediction that tend to
the advancement of real estate policies schemes. This study utilizes machine learning
algorithms as a research method that develops housing price prediction models. We create a
housing cost prediction model In view of machine learning algorithm models for example,
XGBoost, lasso regression and neural system on look at their order precision execution. We
in that point recommend a housing cost prediction model to support a house vender or a real
estate agent for better information based on the valuation of house. Those examinations
exhibit that lasso regression algorithm, in view of accuracy, reliably outperforms alternate
models in the execution of housing cost prediction.
Related Work: -
As you Know the basic libraries of python (if not then go through the above tutorial). We
are going to use the same libraries which we used last time with the addition of seaborn
which isanother built in python library used to do data representation.
Last time, we did for a dataset which had data about Titanic passengers, we knew what
happened to Titanic and we didn’t need to skim through the dataset. But most of the times,
data scientists are given data which are unknown to them. It is very important to know in depth
about the data.
So far so good, today we are going to work on a dataset which consists information about the
location of the house, price and other aspects such as square feet etc. When we work on these
sorts of data, we need to see which column is important for us and which is not. Our main aim
today is to make a model which can give us a good prediction on the price of the house based
on other variables. We are going to use Linear Regression for this dataset and see if it gives us
a good accuracy or not.
What is good accuracy? Well , it depends on what sort of data we are working with , for a
credit risk data an accuracy of 80% may not be good enough but for a data using NLP it would
be good. So we can’t actually define “good accuracy” but anything above 85% is good. Our
aim on this dataset is to achieve an accuracy score of 85%+
First thing first , we import our libraries and dataset and then we see the head of the data to
know how the data looks like and use describe function to see the percentile’s and other key
statistics.
Look at the bedroom columns, the dataset has a house where the house has 33 bedrooms, seems
to be a massive house and would be interesting to know more about it as we progress.
Maximum square feet is 13,450 where as the minimum is 290. We can see that the data is distributed.
Similarly, we can infer so many things by just looking at the describe function.
Now, we are going to see some visualization and also going to see how and what can we infer
from visualization.
As we can see from the visualization 3 bedroom houses are most commonly sold followed by 4
bedrooms. So how is it useful? For a builder having this data, He can make a new building with
more 3 and 4 bedroom’s to attract more buyers.
So now we know that 3 and 4 bedroom’s are highest selling. But at which locality?
Let us start with, If price is getting affecting by living area of the house or not? Price
vs. Square feet and Price vs. Longitude
The plot that we used above is called scatter plot, scatter plot helps us to see how our data
points are scattered and are usually used for two variables. From the first figure we can see that
more the living area, more the price though data is concentrated towards a particular price zone
, but from the figure we can see that the data points seem to be in linear direction. Thanks to
scatter plot we can also see some irregularities that the house with the highest square feet was
sold for very less, maybe there is another factor or probably the data must be wrong. The
second figure tells us about the location of the houses in terms of longitude and it gives us quite
an interesting observation that -122.2 to -122.4 sells houses at much higher amount.
Similarly we compare other factors
As we can see from all the above representation that many factors are affecting the prices of the
house , like square feet which increases the price of the house and even location influencing the
prices of the house.
Now that we are familiar with all these representation and can tell our own story let us move and
create a model to which would predict the price of the house based upon the other factors such
as square feet , water front etc . We are going to see what is linear regression and how do we do
it ?
Linear Regression:-
In easy words a model in statistics which helps us predicts the future based upon past relationship
of variables. So when you see your scatter plot being having data points placed linearly you
know regression can help you!
Regression works on the line equation, y=mx+c, trend line is set through the data points to
predict the outcome.
The variable we are predicting is called the criterion variable and is referred to as Y. The variable
we are basing our predictions on is called the predictor variable and is referred to as X. When
there is only one predictor variable, the prediction method is called Simple Regression. and if
multiple predictor variables are present then multiple regressions. Let’s look at the code,
CHAPTER 4
METHODOLOGY
At the same time the first “learning by memorization” methodology may be useful, it fails to
offer an important aspect known as learning systems – the capacity to mark unseen email
messages. A fruitful learner ought to have the ability which will be the advancement from
distinctive samples to more extensive generalization. This may be Likewise as inductive
thinking or inductive induction. In the attraction nervousness exhibited previously, after the
rats experience a sample of a certain sort about food; they apply their disposition at it once
new, unseen illustrations from claiming nourishment of comparable emanation Also taste.
Should attain generalization in the spam sifting task, those learner might examine the Awhile
ago seen e-mails, and extricate an situated about expressions whose presence for a email
message is characteristic of spam. Then, at another email arrives, those machine could weigh
if a standout among the suspicious expressions gives the idea on it, and foresee its mark
appropriately. Such an arrangement might possibly have the ability effectively to foresee the
name about unseen e-mails.
Linear regression:
Simple linear regression statistical method allows us to summarize and study the relationship
between two continuous quantities variables.
The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
Numerous relapse Investigation will be very nearly the same likewise basic straight relapse.
The main distinction the middle of straightforward straight relapse Also numerous relapse is
in the number for predictors (“x” variables) utilized within those relapse.
Numerous relapse utilization numerous “x” variables for every free variable: (x1)1,
(x2)1,(x3)1, Y1).
In one-variable straight regression, you might information particular case subordinate variable
(i. E. “sales”) against a autonomous variable (i. E. “profit”). Anyhow you could make
intrigued by how diverse sorts from claiming offers impact the relapse. You Might set your
X1 as particular case kind from claiming sales, your X2 Similarly as in turn sort about deals
etc.
Lasso Regression
Lasso regression which may be a standout among those relapse models that would accessible
will examine the information. Further, the regression model may be demonstrated for a sample
and the formula is Additionally recorded to reference.
Lasso regression is a standout among the regularization routines that makes niggardly models
in the vicinity for vast number for features, the place expansive implies whichever of the
following two things:.
Vast enough to improve those inclinations of the model on over-fit. Least ten variables can
foundation over fitting.
Huge enough will cause computational tests. This circumstance could emerge in the event
from claiming a large number or billions about Characteristics.
Tether relapse performs L1 regularization that is it includes those punishment equal of the
supreme esteem of the extent of the coefficients. Here the minimization goal will be
concerning illustration emulated.
Minimization goal = LS Obj + λ (sum about outright esteem of coefficients). The place LS
Obj remains for minimum squares objective which will be nothing yet the straight relapse
target without regularization Furthermore λ may be those turning figure that controls the
measure for regularization. The inclination will build with those expanding quality of λ and
the difference will diminish concerning illustration the measure for shrinkage (λ) increments.
Here the turning component λ controls those qualities for penalty, that is. When λ = 0: we
get same coefficients similarly as basic straight relapse. At λ = ∞: constantly on coefficients
are zero. The point when 0 < λ < ∞: we get coefficients between 0 What's more that for basic
straight relapse thus At λ is amidst the two extremes, we would adjusting those underneath
two plans.
Gradient boosting is a machine taking in strategy to relapse Also arrangement problems that
produces a prediction model in the structure of an group from claiming powerless prediction
models.
The exactness of a predictive model might be helped to two ways:. Possibly by grasping
characteristic building alternately. Toward applying boosting calculations straight far. There
are a significant number boosting calculations in.
Gradient Boosting
XGBoost
AdaBoost
Gentle Boost etc.
Each boosting algorithm need its own underlying math. Also, a slight variety may be watched
same time applying them.
Boosting calculation will be a standout among those the greater part capable taking in
thoughts acquainted in the final one twenty a long time. It might have been intended to order
problems, yet all the it can be developed should relapse too. The inspiration to gradient
boosting might have been an technique. That combines those outputs about large portions
“weak” classifiers to process an capable “committee” a powerless classifier (e. G. Choice
tree) will be person whose slip rate is main superior to irregular guessing.
CHAPTER 5
EXPECTED OUTCOME
Outcome of Internship
Internship offer students a hands-on opportunity to work in desired field. To learn how course
of study applies to the real world and build valuable experience that makes students stronger
candidates for jobs after graduation. An internship can be an excellent way to “try out” a certain
career.
Internship provides an opportunity to gain valuable experience in a career field. It’s a great
wayto gain specific skills and knowledge as well as make contacts and build confidence. More
andmore, employers are using work experiences as screening devices to assess the skills and
abilities of prospective employees. Employers like to see that candidate had some type of
related experience before they consider for a position. Having additional work experience
before applying for job gives an edge over other candidates in a competitive job market.
Students can also use an internship as an experience in particular career, to create a network of
contact. Some interns find permanent, paid employment with the organizations for which they
worked upon completion of the internship. Unlike a trainee program, employment at the
completion of an internship is not guaranteed.
CHAPTER 6
REFERENCES
1. https://towardsdatascience.com/create-a-model-to-predict-house-prices-using-python-
d34fe8fad88f.
2. https://github.com/Shreyas3108/house-price-prediction.
3. https://www.ijitee.org/wp-content/uploads/papers/v8i9/I7849078919.pdf.
4. Gupta and Das (2010) Forecasting the US Real House Price Index: Structural and Non-
Structural Models with and without Fundamentals.
5. Rangan Gupta “Forecasting US real house price returns over 1831– 2013: evidence from
copula models”.