You are on page 1of 25

PRICE ANALYSIS OF BMW CARS IN DEALERSHIPS

PROJECT REPORT
SUBMITTED
BY
SHARAN CHAKRAVARTHY
19PGM45

BUSINESS ANALYTICS
PGDM-I YEAR

1
TABLE OF CONTENTS
S. NO PARTICULARS PAGE NO
1 INTRODUCTION 3
1.1 Importance Data Analytics 3
1.2 Scope of Business Analytics 4

2 Abstract 4
2.1 Objective of the Project 4
2.2 Source of dataset 4
2.3 data set introduction 5

3. Looking at the data


Histogram 5
Box plots 8
Treating the box plot 10
Bar plots 13
Pie chart 16
GG plots 19
4. MODELLING 21
4.1 Logistics regression 23
4.2 KNN
4.3 Decision tree 24

5 ACTIONABLE INSIGHTS AND 25


RECOMMENDATIONS

2
1. INTRODUCTION:
The word analytics has come into the foreground in last decade or so. The
proliferation of the internet and information technology has made analytics very
relevant in the current age.  Analytics is a field which combines data,
information technology, statistical analysis, quantitative methods and computer-
based models into one. This all are combined to provide decision makers all the
possible scenarios to make a well thought and researched decision. The
computer-based model ensures that decision makers are able to see performance
of decision under various scenarios.

APPLICATION:
 Business analytics has a wide range of application starting from:
 Customer relationship management.
 Financial management
 Marketing
 Supply chain management.

1.1 IMPORTANCE OF BUSINESS ANALYTICS:

 Business analytics is a methodology or tool to make a sound


commercial decision. Hence it impacts functioning of the whole
organization. Therefore, business analytics can help improve profitability
of the business, increase market share and revenue and provide better
return to a shareholder.
 Facilitates better understanding of available primary and secondary data,
which again affect operational efficiency of several departments.
 Provides a competitive advantage to companies. In this digital age flow of
information is almost equal to all the players. It is how this information is
utilized makes the company competitive. Business analytics combines
available data with various well thought models to improve business
decisions.
 Converts available data into valuable information. This information can
be presented in any required format, comfortable to the decision maker.

3
1.2 SCOPE OF BUSINESS ANALYTICS:


Business analytics has a wide range of application and usages. It can
be used for descriptive analysis in which data is utilized to understand
past and present situation. This kind of descriptive analysis is used to
asses’ current market position of the company and effectiveness of
previous business decision.
 It is used for predictive analysis, which is typical used to asses’ previous
business performance.
 Business analytics is also used for prescriptive analysis, which is utilized
to formulate optimization techniques for stronger business performance.

 For example, business analytics is used to determine pricing of various


products in a departmental store based past and present set of
information.

2. ABSTRACT
This research work is based on the data set BMW’s pricing challenge, purpose
of this research is to find out the price of the cars sold in the market depending
upon the other features of the car considered as customers choice in buying the
car.
2.1 OBJECTIVE OF THE PROJECT:
Main objective of this data research is to find out the customers preference in
buying a car ultimately depending upon the features provided, mileage driven,
fuel type, colour and car type.
2.2 SOURCE OF DATASET:

https://www.kaggle.com/danielkyrka/bmw-pricing-challenge/data

2.3 DATA SET INTRODUCTION

4
This data set consisits over 75 models of car variying in 10 different colors and 8 types each
model with different engine power and mileage of the car varies, company also provides 8
different feature as an add on to the car

CONTINUOUS VARIABLES IN THE DATASET


1. Price of the car.
2. Mileage of the car.
CATEGORICAL VARIABLES IN THE DATASET(FILMS)
1. Car model.
2. Car type
3. Features offered
4. Paint
DEPENDENT VARIABLES IN THE DATASET(FILMS)
Price of the car is considered as the dependent variable

INDEPENDENT VARIABLE IN THE DATASET(FILMS)


1. car type
2. fuel type
3. model
4. mileage driven
5. colour options

3. LOOKING INTO DATA


Data visualization is an art of how to turn numbers into useful
knowledge. R Programming lets you learn this art by offering a set of inbuilt
functions and libraries to build visualizations and present data.

With ever increasing volume of data, it is impossible to tell stories without


visualizations. Data visualization is an art of how to turn numbers into useful
knowledge.

R Programming lets you learn this art by offering a set of inbuilt functions and
libraries to build visualizations and present data. Before the technical
implementations of the visualization, let’s see first how to select the right chart
type.

Selecting the Right Chart Type

5
There are four basic presentation types:

1. Comparison
2. Composition
3. Distribution
4. Relationship

To determine which amongst these is best suited for your data, I suggest you
should answer a few questions like,

 How many variables do you want to show in a single chart?


 How many data points will you display for each variable?
 Will you display values over a period of time, or among items or groups?

LOADING THE DATA


#Reading the data set
films<- read.csv(file.choose(),header = TRUE)
#To view the class of the data set
class(bmw_pricing_challenge)
[1] "data.frame"

> dim(bmw_pricing_challenge)
[1] 4843 18

> nrow(bmw_pricing_challenge)
[1] 4843

> ncol(bmw_pricing_challenge)
[1] 18

> names(bmw_pricing_challenge)
[1] "maker_key" "model_key" "mileage" "engine_power"
[5] "registration_date" "fuel" "paint_color" "car_type"
[9] "feature_1" "feature_2" "feature_3" "feature_4"
[13] "feature_5" "feature_6" "feature_7" "feature_8"
[17] "price" "sold_at"

> str(bmw_pricing_challenge)
'data.frame': 4843 obs. of 18 variables:
$ maker_key : Factor w/ 1 level "BMW": 1 1 1 1 1 1 1 1 1 1 ...
$ model_key : Factor w/ 75 levels "114","116","118",..: 3 64 22 32 34 29 24 3 75 22 ...
$ mileage : int 140411 13929 183297 128035 97097 152352 205219 115560 123886 139541 ...

6
$ engine_power : num 100 317 120 135 160 225 145 105 125 135 ...
$ registration_date: Factor w/ 199 levels "01 April 2001",..: 56 15 11 94 42 143 141 24 84 111 ...
$ fuel : Factor w/ 4 levels "diesel","electro",..: 1 4 1 1 1 4 1 4 4 1 ...
$ paint_color : Factor w/ 10 levels "beige","black",..: 2 6 10 8 9 2 6 10 2 10 ...
$ car_type : Factor w/ 8 levels "convertible",..: 1 1 1 1 1 1 1 1 1 1 ...
$ feature_1 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_2 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_3 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ feature_4 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ feature_5 : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
$ feature_6 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_7 : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ feature_8 : logi FALSE TRUE FALSE TRUE TRUE TRUE ...
$ price : int 11300 69700 10200 25100 33400 17100 12400 6100 6200 17300 ...
$ sold_at : Factor w/ 9 levels "01 April 2018",..: 4 3 3 3 1 3 3 3 7 7 ...

> summary(bmw_pricing_challenge)
maker_key model_key mileage engine_power registration_date
BMW:4843 320 : 752 Min. : -64 Min. : 0 01 July 2013 : 173
520 : 633 1st Qu.: 102914 1st Qu.:100 01 March 2014 : 162
318 : 569 Median : 141080 Median :120 01 May 2014 : 153
X3 : 438 Mean : 140963 Mean :129 01 January 2013 : 148
116 : 358 3rd Qu.: 175196 3rd Qu.:135 01 September 2013: 148
X1 : 275 Max. :1000376 Max. :423 01 October 2013 : 146
(Other):1818 (Other) :3913
fuel paint_color car_type feature_1 feature_2
diesel :4641 black :1633 estate :1606 Mode :logical Mode :logical
electro : 3 grey :1175 sedan :1168 FALSE:2181 FALSE:1004
hybrid_petrol: 8 blue : 710 suv :1058 TRUE :2662 TRUE :3839
petrol : 191 white : 538 hatchback : 699
brown : 341 subcompact: 117
silver : 329 coupe : 104
(Other): 117 (Other) : 91
feature_3 feature_4 feature_5 feature_6 feature_7
Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:3865 FALSE:3881 FALSE:2613 FALSE:3674 FALSE:329
TRUE :978 TRUE :962 TRUE :2230 TRUE :1169 TRUE :4514

feature_8 price sold_at


Mode :logical Min. : 100 01 May 2018 :809
FALSE:2223 1st Qu.: 10800 01 March 2018 :739
TRUE :2620 Median : 14200 01 April 2018 :693
Mean : 15828 01 June 2018 :604
3rd Qu.: 18600 01 July 2018 :537
Max. :178500 01 August 2018:528
(Other) :933

7
CONVERTION OF VARIABLES:

bmw_pricing_challenge$maker_key <- as.numeric(bmw_pricing_challenge$maker_key)


bmw_pricing_challenge$engine_power<as.numeric(bmw_pricing_challenge$engine_power)
bmw_pricing_challenge$model_key <- as.numeric(bmw_pricing_challenge$model_key)
bmw_pricing_challenge$feature_1 <- as.factor(bmw_pricing_challenge$feature_1)
bmw_pricing_challenge$feature_2 <- as.factor(bmw_pricing_challenge$feature_2)
bmw_pricing_challenge$feature_3 <- as.factor(bmw_pricing_challenge$feature_3)
bmw_pricing_challenge$feature_4 <- as.factor(bmw_pricing_challenge$feature_4)
bmw_pricing_challenge$feature_5 <- as.factor(bmw_pricing_challenge$feature_5)
bmw_pricing_challenge$feature_6 <- as.factor(bmw_pricing_challenge$feature_6)
bmw_pricing_challenge$feature_7 <- as.factor(bmw_pricing_challenge$feature_7)
bmw_pricing_challenge$feature_8 <- as.factor(bmw_pricing_challenge$feature_8)
bmw_pricing_challenge$sold_at <- as.numeric(bmw_pricing_challenge$sold_at)
bmw_pricing_challenge$mileage <-as.numeric(bmw_pricing_challenge$mileage)
bmw_pricing_challenge$price <-as.numeric(bmw_pricing_challenge$price)
bmw_pricing_challenge$car_type<-as.factor(bmw_pricing_challenge$car_type)

3. VISULAIZING THE DATA


Histogram And Boxplot

8
A histogram represents the frequencies of values of a variable bucketed into
ranges. Histogram is similar to bar chat but the difference is it groups the
values into continuous ranges. Each bar in histogram represents the height of
the number of values present in that range. R creates histogram using hist()
function.
HISTOGRAMS :
1. hist(bmw_pricing_challenge$price)

INTERPRETATION:

Here we have considered as length a variable in histogram since only


continuous variables are initiated to be determined.
1. We can infer from the above histogram that the price of the car varies
from 0 to 2800.considering this more than 80% of the car has a price
value of 2800
2. 30% of the cars have price of 900.
3. Lowest price is considered as 0 as they can be deported as scrap.

9
2. hist(bmw_pricing_challenge$mileage)

INTERPREATATION:
 Hence the cars are for second hand sales we need to consider the
mileage driven by the cars.
 Highest driven mileage is 3000 miles and lowest is 200 miles
 Cars with higher mileage driven is considered to have low value that
that of cars with low mileage

BOXPLOTS:
1. boxplot(bmw_pricing_challenge$price)

10
INTERPRETATION
 Price of the cars vary in a small margin i.e. of 8500 to 1200
 There are more outliers in this category hence they can be considered to
be cars with less mileage and in SUV segment
 There is no outlier with lower value this shows us cars have a minimum
value in base even for a scrap.

2. boxplot(bmw_pricing_challenge$mileage)

INTERPRETATION
 mileage of cars 400 to 10 lakh miles.
 Major number of cars being sold are having mileage between 5-
6lakhs.
 There are even cars with high mileage above 8 lakh miles.
 Outliers are car with high mileage thus they lie as an outlier both in
price and mileage because cars with high mileage tend to have low
price in dealership.
3. boxplot(bmw_pricing_challenge$sold_at)
TREATING OUTLIERS
1. TREATING OUTLIER IN PRICE VARIATION

boxplot(bmw_pricing_challenge$price , horizontal = TRUE)


x <- bmw_pricing_challenge$price
qnt <- quantile(x, probs =c(.25, .75), na.rm = T)
caps <- quantile(x, probs = c(.05, .96), na.rm = T)

11
H <- 1.5 * IQR(x, na.rm = T)
x[x<(qnt[1]- H)] <- caps[1]
x[x>(qnt[2]+ H)] <- caps[2]
bmw_pricing_challenge$price <- x
boxplot(bmw_pricing_challenge$price, main="price variation", horizontal = TRUE,
col="green")

TREATING OUTLIER IN MILEAGE


boxplot(bmw_pricing_challenge$mileage, horizontal = TRUE)
x <- bmw_pricing_challenge$mileage
qnt <- quantile(x, probs =c(.25, .75), na.rm = T)
caps <- quantile(x, probs = c(.05, .96), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x<(qnt[1]- H)] <- caps[1]
x[x>(qnt[2]+ H)] <- caps[2]
bmw_pricing_challenge$mileage <- x
boxplot(bmw_pricing_challenge$mileage, main="mileage", horizontal = TRUE,
col="pink")

12
BARPLOTS
1. Bar plot for Gasoline preference by customers

fuel <- table(bmw_pricing_challenge$fuel)

barplot(fuel)
INTERPRETATION
 Diesel cars are more given in dealership’s
 Petrol cars lies in the second position
 Electric and hybrid petrol cars are least preferred

2. Plot for colour preference by customer


paint <- table(bmw_pricing_challenge$paint_color)
barplot(paint,xlab = "colors availabele",ylab = "opted", main =" car
colors offerd", legend =
rownames(paint),col=c("beige","black","blue","brown","green","g
rey","orange","red","yellow","violet"))

13
INTERPRETATION
 Black colour cars are more given in dealerships and customer
preference is towards it.
 Orange, Green, Beige are the most unlikely colours.
 Grey is the most liked colour next to black.
 Blue, white, silver, brown are the colours in moderate
preference .

Plot for car types most preferred by customers .

cartype<- table(bmw_pricing_challenge$car_type)
barplot(cartype,xlab = "types of car",ylab = "opted", main ="types
of cars in intrest", legend =
rownames(cartype),col=c("green","red","yellow","blue","pink","m
agenta","beige","orange"))

INTERPRETATION
 Estate cars are more preferred by the customers.
 Sedan and SUV class cars lye next to estate cars.
 Convertible cars and van are least preferred.
 Sub-compact, coupe cars are at medium preferred cars .
3D PIE CHART

14
LIBRARY INITIALISATION
install.packages("plotrix")
library(plotrix)

slices<-table(bmw_pricing_challenge$car_type)
pct<-round(slices/sum(slices)*100)
pct
lbls<-paste(c("convertable","estate","sedan","suv","van"),"
",pct,"%",sep = "")
pie3D(slices,labels=lbls,col=rainbow(5),explode=0.0,main="3D Pie
chart")

INTERPRETATION
 Sedan cars are more likely to go under dealership.
 Vans take the second most value in coming to dealerships
 Estate cars are in a medium phase but has high value as referred to
the previous plot
 Coupe and convertible cars has lest inbound to dealership

GGPLOTS

LIBRARY INTILISATION
install.packages("tidyverse")
library(tidyverse)

15
plot comparing model,mileage,engine power

ggplot(data = bmw_pricing_challenge) + geom_point(mapping =


aes(x=model_key,y=mileage,colour=engine_power))

INTERPRETATION:
 Cars in dealer ship have no balanced state under engine power
and mileage
 Cars with high engine powers are more likely to be least driven.
 Cas with low engine power are scattered all along the mileage
driven.

Plot comparing mileage and engine power

16
ggplot(data = bmw_pricing_challenge) + geom_point(mapping =
aes(x=price,y=mileage,colour=engine_power))

INTERPRETATION
 Cars with low engine power are with high mileage.
 Even though there are few cars with low power and low mileage
which can be considered as outlier.
 There are more cars with mid and high power and also with
medium mileage driven.
 Even though there are more cars low power cars are widely used
and returned to the dealer as per the plot.

17
Plot comparing fuel and engine power
ggplot(data = bmw_pricing_challenge) + geom_point(mapping =
aes(x=fuel,y=engine_power,colour=mileage))

INTERPRETATION
 Mid power ranges to call cars irrespective to gasoline used
 Petrol cars have high power
 Diesel cars are prone to low and mid-range power with low power.
 Hybrid cars provide low to high engine power as plot defines it
clearly
 Certain diesel cars show high mileage irrespective to the high
engine power

18
4. MODELLING
4.1 LINEAR REGRESSION
Creating a Linear Regression in R.
In this case, linear regression assumes that there exists a linear relationship
between the response variable and the explanatory variables. This means that
you can fit a line between the two (or more variables). ... A linear
regression can be calculated in R with the command lm .
Step 1: Load the data into R. Follow these four steps for each dataset:
Step 2: Make sure your data meet the assumptions.
Step 3: Perform the linear regression analysis.
Step 4: Check for homoscedasticity
Step 5: Visualize the results with a graph.
Step 6: Report your results.

bmw_pricing_challenge <- read.csv(file.choose(),header = TRUE)


attach(bmw_pricing_challenge)
names(bmw_pricing_challenge)
class(price)
bmw_pricing_challenge$price <-as.numeric(bmw_pricing_challenge$price)
class(car_type)
class(mileage)
plot(bmw_pricing_challenge$engine_power,bmw_pricing_challenge$price)
cor(bmw_pricing_challenge$engine_power,bmw_pricing_challenge$price)
help(lm)
?lm
mod <-
lm(bmw_pricing_challenge$engine_power~bmw_pricing_challenge$price)mod
<- lm(bmw_pricing_challenge$price~bmw_pricing_challenge$mileage)
summary(mod)
attributes(mode)

mod$coefficients
mod$coef

19
coef(mod)

20
4.2 KNN:
The KNN or k-nearest neighbors algorithm is one of the simplest machine
learning algorithms and is an example of instance-based learning, where new
data are classified based on stored, labeled instances.

Using R For k-Nearest Neighbors (KNN)


library(stats)
library(caret)
library(class)
data <- read.csv(file.choose(),header = T)
str(data)
filter_games<-na.omit(data)
filter_games$price = as.factor(ifelse(filter_games$price<=8600, "No",
"Yes"))
filter_games$price <- as.numeric(filter_games$price)
filter_games$maker_key <- as.numeric(filter_games$maker_key)
filter_games$engine_power<as.numeric(bmw_pricing_challenge$engine_p
ower)
filter_games$model_key <-
as.numeric(bmw_pricing_challenge$model_key)
filter_games$feature_1 <- as.numeric(bmw_pricing_challenge$feature_1)
filter_games$feature_2 <- as.numeric(bmw_pricing_challenge$feature_2)
filter_games$feature_3 <- as.numeric(bmw_pricing_challenge$feature_3)
filter_games$feature_4 <- as.numeric(bmw_pricing_challenge$feature_4)
filter_games$feature_5 <- as.numeric(bmw_pricing_challenge$feature_5)
filter_games$feature_6 <- as.numeric(bmw_pricing_challenge$feature_6)
filter_games$feature_7 <- as.numeric(bmw_pricing_challenge$feature_7)
filter_games$feature_8 <- as.factor(bmw_pricing_challenge$feature_8)
filter_games$sold_at <- as.numeric(bmw_pricing_challenge$sold_at)
filter_games$mileage <-as.numeric(bmw_pricing_challenge$mileage)
filter_games$car_type<-as.factor(bmw_pricing_challenge$car_type)
trainIndex <- createDataPartition(filter_games$price, p=0.80, list=F)
train_data <- filter_games[trainIndex,]
test_df <- filter_games[-trainIndex,]
summary(train_data)
summary(test_df)
ml <- knn(train = train_data[,-10], test = test_df[,-10],
cl=train_data$price,k=15)
test_df = data.frame(test_df, ml)

21
AppData_ConfusionMatrix <- confusionMatrix(test_df$ml,
test_df$suicides_no)
AppData_ConfusionMatrix

INTERPRETATION
 The Accuracy of model determines the overall predicted accuracy of model, the accuracy of
current model is 99.95%
 Sensitivity represents the number of positives returned by the model; the sensitivity of the
current model is 100%
 Specificity represents the number of negatives returned by the model; the specificity of the
current model is 100%

4.3DECISION TREE

library(ISLR)
library(tree)
data1<-read.csv(file.choose(),header=TRUE)
summary(bmw_pricing_challenge)
names(bmw_pricing_challenge)
bmw_pricing_challenge$engine_power<as.numeric(bmw_pricing_challeng
e$engine_power)
bmw_pricing_challenge$sold_at <-
as.numeric(bmw_pricing_challenge$sold_at)

22
bmw_pricing_challenge$mileage <-
as.numeric(bmw_pricing_challenge$mileage)
bmw_pricing_challenge$price <-
as.numeric(bmw_pricing_challenge$price)
buy=ifelse(data1$price ==1000,"Yes","No")
data=data.frame(data,buy)
names(data)
data=data[,-1]
data=data[,-24]
names(data)
tree.data=tree(loan~.,data=data)
plot(tree.data)
text(tree.data, pretty = 0)
set.seed(2)
train=sample(1:nrow(data),nrow(data)/2)
test=-train
training_data=data[train,]
testing_data=data[test,]
testing_loan=loan[test]
tree_model=tree(loan~.,training_data)
plot(tree_model)
text(tree_model, pretty=0)
tree_pred=predict(tree_model,testing_data, type = "class")
mean(tree_pred!=testing_loan)
set.seed(3)
cv_tree= cv.tree(tree_model, FUN=prune.misclass)
cv_tree
plot(cv_tree$size,cv_tree$dev,type = "b")
pruned_model=prune.misclass(tree_model, best = 4)
plot(pruned_model)
text(pruned_model, pretty=0)
tree_pred = predict(pruned_model, testing_data, type="class")
mean(tree_pred!=testing_loan)

23
OUTPUT

 Which is the better predictor model?


 According to the dataset and its analysis, there is error in the logistic regression. So to
come up with the result between two more models,
 Decision tree accuracy: After pruning the error rate is more.
 KNN accuracy: 99.5%
 So with this, the result can be drawn that since KNN has more accuracy, it can be
concluded as the best predictor among the remaining others.

24
5. ACTIONABLE INSIGHTS AND RECOMMENDATIONS
• This dataset is about a BMW’S price fixation on second hand cars

• There is minimum number of cars purchased where below 500000 of mileage

• Cars purchased on second hand dealers where more of diesel petroleum with
low and medium powerd

• Hybrid petrol and electric cars were sold even though of low prise sales was also
low

25

You might also like