You are on page 1of 13

Business Analytics Project Report

On
“Developing a Prediction Model Using Random Forest Regressor to find the House Prices”

Submitted in partial fulfillment for award of


Masters in Business Administration
Degree
In
General Management

By
Team 17
Aditya Prasad Satapathy 22BM63011
Jyotirmaya Mahapatra 22BM63057
Shashwato Das Sharma 22BM63115

Under the guidance of


Professor Sujoy Bhattacharya

Vinod Gupta School of Management


Indian Institute of Technology, Kharagpur
Kharagpur, India
November 2022
Table of Contents
Introduction...............................................................................................................................................2
Random Forest Regressor:...................................................................................................................3
Problem Statement....................................................................................................................................4
Data........................................................................................................................................................4
File descriptions.....................................................................................................................................4
Data fields...............................................................................................................................................4
Code........................................................................................................................................................5
Load Dataset......................................................................................................................................5
Data Cleaning....................................................................................................................................5
Feature Encoding...............................................................................................................................7
Preparing the test data......................................................................................................................9
Predicting the Sales Price................................................................................................................10
Uploading in Kaggle................................................................................................................................11
Conclusion................................................................................................................................................12
Bibliography............................................................................................................................................12
Introduction

In today’s world, over 2.5 quintillion bytes of data are created every single day. Being a Data
Science Engineer/Data Scientist/ML Engineer or whatever you might call the art of data analysis
is the first thing that you should master.
Data Analysis is a process of inspecting, cleansing, transforming, and modeling data so that we
can derive some useful information from the data and use it for future predictions. Data analysis
tools make it easier for users to process and manipulate data, analyze the relationships and
correlations between data sets, and it also helps to identify patterns and trends for interpretation.
Machine learning is a tool for turning information into knowledge. In the past 50 years, there has
been an explosion of data. This mass of data is useless unless we analyse it and find the patterns
hidden within. Machine learning techniques are used to automatically find the valuable
underlying patterns within complex data that we would otherwise struggle to discover. The
hidden patterns and knowledge about a problem can be used to predict future events and perform
all kinds of complex decision making.
Most of us are unaware that we already interact with Machine Learning every single day. Every
time we Google something, listen to a song or even take a photo, Machine Learning is becoming
part of the engine behind it, constantly learning, and improving from every interaction. It is also
behind world-changing advances like detecting cancer, creating new drugs and self-driving cars.
Terminology involved:
 Dataset: A set of data examples, that contain features important to solving the problem.
 Features: Important pieces of data that help us understand a problem. These are fed in to
a Machine Learning algorithm to help it learn.
 Model: The representation (internal model) of a phenomenon that a Machine Learning
algorithm has learnt. It learns this from the data it is shown during training. The model is
the output you get after training an algorithm. For example, a decision tree algorithm
would be trained and produce a decision tree model.
Process:
 Data Collection: Collect the data that the algorithm will learn from.
 Data Preparation: Format and engineer the data into the optimal format, extracting
important features and performing dimensionality reduction.
 Training: Also known as the fitting stage, this is where the Machine Learning algorithm
learns by showing it the data that has been collected and prepared.
 Evaluation: Test the model to see how well it performs.
 Tuning: Fine tune the model to maximize its performance.
Random Forest Regressor:

The Decision Tree is an easily understood and interpreted algorithm and hence a single tree may
not be enough for the model to learn the features from it. On the other hand, Random Forest is
also a “Tree”-based algorithm that uses the qualities features of multiple Decision Trees for
making decisions.
Therefore, it can be referred to as a ‘Forest’ of trees and hence the name “Random Forest”. The
term ‘Random’ is due to the fact that this algorithm is a forest of ‘Randomly created Decision
Trees’.
The Decision Tree algorithm has a major disadvantage in that it causes over-fitting. This
problem can be limited by implementing the Random Forest Regression in place of the Decision
Tree Regression. Additionally, the Random Forest algorithm is also amazingly fast and robust
than other regression models.

To summarize in short, The Random Forest Algorithm merges the output of multiple Decision
Trees to generate the final output.
Problem Statement
Ask a home buyer to describe their dream house, and they probably won't begin with the height
of the basement ceiling or the proximity to an east-west railroad. But this playground
competition's dataset proves that much more influences price negotiations than the number of
bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames,
Iowa, this competition challenges you to predict the final price of each home.
Data
File descriptions
train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but
lightly edited to match the column names used here
Data fields
Here's a brief version of what you'll find in the data description file-
 SalePrice - the property's sale price in dollars. This is the target variable that you're trying
to predict.
 MSSubClass: The building class
 MSZoning: The general zoning classification
 LotFrontage: Linear feet of street connected to property
 LotArea: Lot size in square feet
 Street: Type of road access
 Alley: Type of alley access
 LotShape: General shape of property
 LandContour: Flatness of the property
 Utilities: Type of utilities available
 LotConfig: Lot configuration
 LandSlope: Slope of property
 Neighborhood: Physical locations within Ames city limits
 Condition1: Proximity to main road or railroad
 Condition2: Proximity to main road or railroad (if a second is present)
 BldgType: Type of dwelling
 HouseStyle: Style of dwelling
And many more.
Code
Steps followed –
1. Load dataset
2. Data Cleaning
3. Feature Engineering
4. Prepare the test data
5. Predict sales price of test data
Load Dataset

The code above, reads the CSV file which contains the trained data, downloaded from Kaggle.

We use head() function to check if the CSV file has been loaded, it outputs the first five rows of
data.

Data Cleaning

We use the isna() function to find out the null data present in the dataset.
This gives us the total number of columns which contains NULL values.

There are two types of data that needs to be filled. It could either be INT/FLOAT or Categorical
Data.
For INT/FLOAT type of data, we can either use Mean/Median as a standard to fill the NULL
values. We can take the Mean/Median of that column and fill the NULL values with the
calculated Mean/Median. In our project, we have used Mean as the standard.
For Categorical Data or object type data, we cannot use Mean/Median, so we use Mode as the
standard. The data which appears the most will be used to fill the NULL values of that column.
First, we define a list and store the names of the columns which contains NULL values. We do
this in 2 steps, list_obj_col contains the object type data or categorical data and list_num_col
contains INT/FLOAT type data.

Once, segregation is done, we define a function using fillna() to fill the NULL values in
categorical data with Mode and INT/FLOAT data with Mean.

Once we run the function on our training data, i.e. df_train and check for NULL columns, it
shows 0. Thus, we have filled all the NULL values.
Feature Encoding

Only numerical values can be used in machine learning models. It is important to convert the
categorical values of the pertinent attributes into numerical ones because of this. Feature
encoding is the name of this process.
Any structured dataset typically consists of numerous columns that combine numerical and
category variables. Only the numbers can be understood by a machine. It is unable to
comprehend the text. Essentially, that is how machine learning algorithms work too. We need to
convert categorical columns to numerical columns for this reason so that a machine learning
algorithm understands it. This process is called categorical encoding.
Before diving into methods of encoding, we first understand the different types of categorical
data.
Categorical data can be classified into two types:
a) Nominal Data: A set of variables comprising of a finite set of discrete values and having
no relationship between them.
b) Ordinal Data: A variable is made up of a limited number of discrete values that are sorted
in order of importance.
There are various ways of handling Categorical variables, two methods are most widely used for
machine learning algorithms. These two methods are:

a) Label Encoding: This method is very straightforward and involves turning each value in a
column into a number. This type of encoding is used in case we have ordinal data.
b) One-Hot Encoding: One-hot encoding divides the column into many columns to
transform the categorical data into numerical data. Depending on which column carries
the value, the numbers are changed to 1s or 0s. This type of encoding is used in case we
have nominal data.
One problem that we face during feature encoding is addition of extra variables in the test data.
For example, we have a column called ‘PavedDrive” in our training dataset. It has 3 unique
variables in it named ‘Y’,’N’ and ‘P’. After one hot encoding, we will get 2 new columns in the
training dataset. But if the test dataset contains any other variable, let’s say, ‘K’ in ‘PavedDrive’
column, after one hot encoding it will create 3 new columns. This extra column will not be
recognized by our model since it has been trained with 2 columns.

train
fillna_all()
encode_all()

test

If we fill the NULL values and encode the data separately, the above mentioned problem exists.
To deal with this problem we concatenate both the datasets to make one dataset and apply the fill
and encode functions to that connected dataset.

fillna_all()
Concatenated train encode_all()
and test datasets

Since, all the 4 variables ‘Y’, ‘N’,’P’ and ‘K’ are present in the concatenated datasets, the
mentioned inconsistency is now resolved.

As discussed above, we read the test CSV file to df_test and concatenate the test and train data
into df_train_test using the above code.
We fill the NULL values using our predefined function fillna_all().

We then, encode the concatenated datasets to get the final dataset df_train_test_final.
This is what our dataset roughly looks like.
252 categorical columns + 37 numerical columns = 289

1460 rows X_train

1459 rows X_test

Preparing the test data

After filling the NULL values and encoding, we separate the train and test data again into
X_train and X_test.

Predicting the Sales Price


We use the RandomForestRegressor to predict the Sales Price.

We fit the X_train (training data) and predict the Sales Price for the X_test (test dataset) dataset.
y_predict is an array that has the predicted Sales Price. We then create the submission file by
concatenating the ‘Id’ column and ‘SalePrice’ column into a CSV file named ‘submission_1.csv’.

Uploading in Kaggle
Kaggle Score

Leaderboard

Conclusion
The research project assists us in building a fundamental grasp of Business Analytics as a decision-
making tool. We have chosen Random Forest Regressor for computation of our model due to certain
advantages over other models reduces overfitting in decision trees and helps to improve the accuracy, it
is flexible to both classification and regression problems. It also normalizes data that is not required as it
uses a rule-based approach.

It also highlights the Random Forest Regressor's limits as a tool that requires much computational power
as well as resources as it builds numerous trees to combine their outputs. The research conducted may
be applied to various decision-making scenarios as long as there is enough data available for the model
to run properly without over-fitting or under-fitting our model. It requires much time for training as it
combines a lot of decision trees to determine the class. Due to the ensemble of decision trees, it also
suffers interpretability and fails to determine the significance of each variable.

Bibliography
 House Prices - Advanced Regression Techniques
 Coding and model development: Google Collab link

You might also like