You are on page 1of 8

Name: Sakhil Pant

Roll no: 21319 (MBA Spring 2021 - KUSOM)

Data Pre-Processing and Regression analysis


Details of Data
The MagicBricks.csv has 11 columns and 1259 rows. In the dataset columns: Furnishing,
Locality, Status, Transaction, and Type have characters and the columns: Area, BHK, Bathroom,
Parking, Price, and Per_Sqft have numerical data. After carefully examining the values of the
variables we can see that even though BHK, Parking and Bathroom are of numeric type they
have repetitive fixed data so they are regarded as categorical data along with Furnishing, Locality,
Status, Transaction, and Type. On the other hand, Area, Price, and Per_Sqft are continuous variables.

Step 1: Loading Packages and .CSV Files:


Since the dataset contains numerous missing values, it is necessary to preprocessed the data.
But in order to do so, functions are required which is obtained by installing and loading the necessary
library. Here I have installed and loaded dplyr and stringr for deleting, editing or adding values in the
dataset and caTools is used for splitting the data into training and testing data. Then the
MagicBricks.csv was loaded using read.csv () function.
#Loading Packages
install. packages("stringr")
library(dplyr)
library(stringr)
library(caTools)
#Reading MagicBricks.csv File
csvdata <- read.csv ("D:/Studies/KUSOM/Sem III/Big Data Analytics/assignment/2/MagicBricks.csv")
View(data)

Page | 1
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

Step 2: Data Pre-Processing – Cleaning the data


Searching every missing or NA value from each column manually is not possible. So in
order to check the no of missing values in each column, sapply() was used. Also mode function
was created to calculate the mode value of categorical variables. Now the Locality column was
removed because it had 365 distinct values which wont help model to gain any insight. Then
using the mode function, any missing value or NA was replaced with mode value of each
column except Per_sqfeet column as the missing or NA values was replaced by the column’s
mean since its data was continuous. At the end, again sapply() function was used to check for
any missing values.
#Function for Mode Calculation
Mode<- function(z) {
val <- unique(z)
val[which.max(tabulate(match(z, val)))]}
#Finding empty cells and cells with NA value
sapply(csvdata, function(val){
if(typeof(val) !="character"){
sum(is.na(val))
} else {
sum(val=="")
}})
str(csvdata)

Page | 2
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

#Removing Locality Column as it has many unique values.


length(unique(csvdata$Locality))
#output 365
csvdata <- subset(csvdata,select = -(Locality))

#Calculation of Mode values for categorical data


modvalFurnishing <- Mode(csvdata$Furnishing)
modvalType <- Mode(csvdata$Type)
modvalBathroom <- Mode(csvdata$Bathroom)
modValParking <- Mode(csvdata$Parking)

#Replacing Missing Values with mode values


csvdata$Furnishing <- ifelse(csvdata$Furnishing =="",modvalFurnishing,csvdata$Furnishing)
csvdata$Type <- ifelse(csvdata$Type=="",modvalType,csvdata$Type)
csvdata$Bathroom <- ifelse(is.na(csvdata$Bathroom),modvalBathroom,csvdata$Bathroom)
csvdata$Parking <- ifelse(is.na(csvdata$Parking),modValParking,csvdata$Parking)

Page | 3
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

#Replace empty values in per_sqfeet with its Mean Value


csvdata$Per_Sqft[is.na(csvdata$Per_Sqft)] <- mean (csvdata$Per_Sqft, na.rm = TRUE)
#Rechecking for any empty cells or cells with NA value
sapply(csvdata, function(val){
if(typeof(val) !="character"){
sum(is.na(val))
} else {
sum(val=="")
}})
str(csvdata)

Page | 4
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

Step 3- Linear Regression


As we know, Linear Regression works only with numeric and categorical values, so first
I converted the columns with character values (string values) into factor values. So only columns
Type, Transaction, Status and Furnishing was converted. While we could convert numerical
categorical data like BHK, Bathroom and Parking into factors, it is not necessary as R can
executed these data. However, it is important for variables to be in right data type.
Now the data is divided in to training set and test dataset where they are split into 80%
and 20% respectively using sample.split() function. Using set.seed(), every time the model runs
on any system, it will create same set of train and test data. I have used training dataset to
generate the linear regression model where price is predicted and compared the predicted price
with actual price. Later, two graphs were plotted to have more clear view on predicted and actual
price. First one was the plotted between actual price and predicted price and second one was the
line graph where predicted price in blue was overlay with actual price in red to indicate how
accurate the anticipation was. At the end, average distance between the actual and predicted
values was calculated which denotes the accuracy.

#Converting to String values to factor values


csvdata$Type <- as.factor(csvdata$Type)
csvdata$Transaction <- as.factor(csvdata$Transaction)
csvdata$Status <- as.factor(csvdata$Status)
csvdata$Furnishing <- as.factor(csvdata$Furnishing)

#Property data
str(csvdata)

#Everytime Program Runs, Same data is generated


set.seed(10)

Page | 5
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

#Splitting into Training and Testing Data


#Training Data = 80% of data & Testing Data = Remainging 20% of Data
splittedData<- sample.split(csvdata, SplitRatio = .8)
trainData<- subset(csvdata, splittedData=="TRUE")
testData<- subset(csvdata, splittedData=="FALSE")
#Creating Multiple Regression Model
regModel <- lm(Price~., data = trainData)
summary(regModel)

Page | 6
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

#Prediction of Test Result


pricePredicted<- predict(regModel, newdata = testData)

#Scatter Chart showing Actual & Predicted Price starting from origin
plot(testData$Price,pricePredicted, xlab = "Actual Price",ylab = "Predicted price",
xlim=c(0,max(testData$Price)),ylim=c(0,max(pricePredicted)))

#y=mx+c with slope=1, intercept=0


#points that lies on line indicate actual and predicted price are equal
abline(a=0,b=1)

#Line graph to compare actual and predicted values of test data


plot(testData$Price,ylab = "Price " , type = 'l', lty = 1, col = "blue")

#Adding predicted value to existing plot of actual values


lines(pricePredicted, type = "l",lty = 1, col = "brown")
legend("top", legend=c("Actual", "Predicted"),
col=c("blue", "brown"), lty = 1,
bty='n',horiz = TRUE, cex=0.4)

Page | 7
Name: Sakhil Pant
Roll no: 21319 (MBA Spring 2021 - KUSOM)

#Calculating accuracy (RMSE) - the average distance between the actual and predicted values
rmse <- sqrt(mean(testData$Price - pricePredicted)^2)
rmse
#output = 1691254

Conclusion:
From the first plotted graph, we can see that majority of scatter plot are close to linear
line y= a + bx where b=1 and a = 0. This means actual and predicted prices are very close to each
other where plots are closed along the line. But many outliers are also far from the line, so we
need more datasets to train the model for better predictions. Similarly, second graph also shows
the similar result where bulk of data are accurately anticipated. Lastly, the RMSE value was
1691254 which used root mean square error to determine the model’s accuracy and comparing it
with RMSE value with prices in test data, it appears to be lower meaning the model accurately
predicts most of the test data. So in order for the model to increase its prediction accuracy, we
must train the data with additional massive data sets which help to learn more about the prices
changes of various type of variables.

Page | 8

You might also like