Professional Documents
Culture Documents
PROJECT REPORT
SUBMITTED
BY
SHARAN CHAKRAVARTHY
19PGM45
BUSINESS ANALYTICS
PGDM-I YEAR
1
TABLE OF CONTENTS
S. NO PARTICULARS PAGE NO
1 INTRODUCTION 3
1.1 Importance Data Analytics 3
1.2 Scope of Business Analytics 4
2 Abstract 4
2.1 Objective of the Project 4
2.2 Source of dataset 4
2.3 data set introduction 5
2
1. INTRODUCTION:
The word analytics has come into the foreground in last decade or so. The
proliferation of the internet and information technology has made analytics very
relevant in the current age. Analytics is a field which combines data,
information technology, statistical analysis, quantitative methods and computer-
based models into one. This all are combined to provide decision makers all the
possible scenarios to make a well thought and researched decision. The
computer-based model ensures that decision makers are able to see performance
of decision under various scenarios.
APPLICATION:
Business analytics has a wide range of application starting from:
Customer relationship management.
Financial management
Marketing
Supply chain management.
3
1.2 SCOPE OF BUSINESS ANALYTICS:
Business analytics has a wide range of application and usages. It can
be used for descriptive analysis in which data is utilized to understand
past and present situation. This kind of descriptive analysis is used to
asses’ current market position of the company and effectiveness of
previous business decision.
It is used for predictive analysis, which is typical used to asses’ previous
business performance.
Business analytics is also used for prescriptive analysis, which is utilized
to formulate optimization techniques for stronger business performance.
2. ABSTRACT
This research work is based on the data set BMW’s pricing challenge, purpose
of this research is to find out the price of the cars sold in the market depending
upon the other features of the car considered as customers choice in buying the
car.
2.1 OBJECTIVE OF THE PROJECT:
Main objective of this data research is to find out the customers preference in
buying a car ultimately depending upon the features provided, mileage driven,
fuel type, colour and car type.
2.2 SOURCE OF DATASET:
https://www.kaggle.com/danielkyrka/bmw-pricing-challenge/data
4
This data set consisits over 75 models of car variying in 10 different colors and 8 types each
model with different engine power and mileage of the car varies, company also provides 8
different feature as an add on to the car
R Programming lets you learn this art by offering a set of inbuilt functions and
libraries to build visualizations and present data. Before the technical
implementations of the visualization, let’s see first how to select the right chart
type.
5
There are four basic presentation types:
1. Comparison
2. Composition
3. Distribution
4. Relationship
To determine which amongst these is best suited for your data, I suggest you
should answer a few questions like,
> dim(bmw_pricing_challenge)
[1] 4843 18
> nrow(bmw_pricing_challenge)
[1] 4843
> ncol(bmw_pricing_challenge)
[1] 18
> names(bmw_pricing_challenge)
[1] "maker_key" "model_key" "mileage" "engine_power"
[5] "registration_date" "fuel" "paint_color" "car_type"
[9] "feature_1" "feature_2" "feature_3" "feature_4"
[13] "feature_5" "feature_6" "feature_7" "feature_8"
[17] "price" "sold_at"
> str(bmw_pricing_challenge)
'data.frame': 4843 obs. of 18 variables:
$ maker_key : Factor w/ 1 level "BMW": 1 1 1 1 1 1 1 1 1 1 ...
$ model_key : Factor w/ 75 levels "114","116","118",..: 3 64 22 32 34 29 24 3 75 22 ...
$ mileage : int 140411 13929 183297 128035 97097 152352 205219 115560 123886 139541 ...
6
$ engine_power : num 100 317 120 135 160 225 145 105 125 135 ...
$ registration_date: Factor w/ 199 levels "01 April 2001",..: 56 15 11 94 42 143 141 24 84 111 ...
$ fuel : Factor w/ 4 levels "diesel","electro",..: 1 4 1 1 1 4 1 4 4 1 ...
$ paint_color : Factor w/ 10 levels "beige","black",..: 2 6 10 8 9 2 6 10 2 10 ...
$ car_type : Factor w/ 8 levels "convertible",..: 1 1 1 1 1 1 1 1 1 1 ...
$ feature_1 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_2 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_3 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ feature_4 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ feature_5 : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
$ feature_6 : logi TRUE TRUE FALSE TRUE TRUE TRUE ...
$ feature_7 : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ feature_8 : logi FALSE TRUE FALSE TRUE TRUE TRUE ...
$ price : int 11300 69700 10200 25100 33400 17100 12400 6100 6200 17300 ...
$ sold_at : Factor w/ 9 levels "01 April 2018",..: 4 3 3 3 1 3 3 3 7 7 ...
> summary(bmw_pricing_challenge)
maker_key model_key mileage engine_power registration_date
BMW:4843 320 : 752 Min. : -64 Min. : 0 01 July 2013 : 173
520 : 633 1st Qu.: 102914 1st Qu.:100 01 March 2014 : 162
318 : 569 Median : 141080 Median :120 01 May 2014 : 153
X3 : 438 Mean : 140963 Mean :129 01 January 2013 : 148
116 : 358 3rd Qu.: 175196 3rd Qu.:135 01 September 2013: 148
X1 : 275 Max. :1000376 Max. :423 01 October 2013 : 146
(Other):1818 (Other) :3913
fuel paint_color car_type feature_1 feature_2
diesel :4641 black :1633 estate :1606 Mode :logical Mode :logical
electro : 3 grey :1175 sedan :1168 FALSE:2181 FALSE:1004
hybrid_petrol: 8 blue : 710 suv :1058 TRUE :2662 TRUE :3839
petrol : 191 white : 538 hatchback : 699
brown : 341 subcompact: 117
silver : 329 coupe : 104
(Other): 117 (Other) : 91
feature_3 feature_4 feature_5 feature_6 feature_7
Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:3865 FALSE:3881 FALSE:2613 FALSE:3674 FALSE:329
TRUE :978 TRUE :962 TRUE :2230 TRUE :1169 TRUE :4514
7
CONVERTION OF VARIABLES:
8
A histogram represents the frequencies of values of a variable bucketed into
ranges. Histogram is similar to bar chat but the difference is it groups the
values into continuous ranges. Each bar in histogram represents the height of
the number of values present in that range. R creates histogram using hist()
function.
HISTOGRAMS :
1. hist(bmw_pricing_challenge$price)
INTERPRETATION:
9
2. hist(bmw_pricing_challenge$mileage)
INTERPREATATION:
Hence the cars are for second hand sales we need to consider the
mileage driven by the cars.
Highest driven mileage is 3000 miles and lowest is 200 miles
Cars with higher mileage driven is considered to have low value that
that of cars with low mileage
BOXPLOTS:
1. boxplot(bmw_pricing_challenge$price)
10
INTERPRETATION
Price of the cars vary in a small margin i.e. of 8500 to 1200
There are more outliers in this category hence they can be considered to
be cars with less mileage and in SUV segment
There is no outlier with lower value this shows us cars have a minimum
value in base even for a scrap.
2. boxplot(bmw_pricing_challenge$mileage)
INTERPRETATION
mileage of cars 400 to 10 lakh miles.
Major number of cars being sold are having mileage between 5-
6lakhs.
There are even cars with high mileage above 8 lakh miles.
Outliers are car with high mileage thus they lie as an outlier both in
price and mileage because cars with high mileage tend to have low
price in dealership.
3. boxplot(bmw_pricing_challenge$sold_at)
TREATING OUTLIERS
1. TREATING OUTLIER IN PRICE VARIATION
11
H <- 1.5 * IQR(x, na.rm = T)
x[x<(qnt[1]- H)] <- caps[1]
x[x>(qnt[2]+ H)] <- caps[2]
bmw_pricing_challenge$price <- x
boxplot(bmw_pricing_challenge$price, main="price variation", horizontal = TRUE,
col="green")
12
BARPLOTS
1. Bar plot for Gasoline preference by customers
barplot(fuel)
INTERPRETATION
Diesel cars are more given in dealership’s
Petrol cars lies in the second position
Electric and hybrid petrol cars are least preferred
13
INTERPRETATION
Black colour cars are more given in dealerships and customer
preference is towards it.
Orange, Green, Beige are the most unlikely colours.
Grey is the most liked colour next to black.
Blue, white, silver, brown are the colours in moderate
preference .
cartype<- table(bmw_pricing_challenge$car_type)
barplot(cartype,xlab = "types of car",ylab = "opted", main ="types
of cars in intrest", legend =
rownames(cartype),col=c("green","red","yellow","blue","pink","m
agenta","beige","orange"))
INTERPRETATION
Estate cars are more preferred by the customers.
Sedan and SUV class cars lye next to estate cars.
Convertible cars and van are least preferred.
Sub-compact, coupe cars are at medium preferred cars .
3D PIE CHART
14
LIBRARY INITIALISATION
install.packages("plotrix")
library(plotrix)
slices<-table(bmw_pricing_challenge$car_type)
pct<-round(slices/sum(slices)*100)
pct
lbls<-paste(c("convertable","estate","sedan","suv","van"),"
",pct,"%",sep = "")
pie3D(slices,labels=lbls,col=rainbow(5),explode=0.0,main="3D Pie
chart")
INTERPRETATION
Sedan cars are more likely to go under dealership.
Vans take the second most value in coming to dealerships
Estate cars are in a medium phase but has high value as referred to
the previous plot
Coupe and convertible cars has lest inbound to dealership
GGPLOTS
LIBRARY INTILISATION
install.packages("tidyverse")
library(tidyverse)
15
plot comparing model,mileage,engine power
INTERPRETATION:
Cars in dealer ship have no balanced state under engine power
and mileage
Cars with high engine powers are more likely to be least driven.
Cas with low engine power are scattered all along the mileage
driven.
16
ggplot(data = bmw_pricing_challenge) + geom_point(mapping =
aes(x=price,y=mileage,colour=engine_power))
INTERPRETATION
Cars with low engine power are with high mileage.
Even though there are few cars with low power and low mileage
which can be considered as outlier.
There are more cars with mid and high power and also with
medium mileage driven.
Even though there are more cars low power cars are widely used
and returned to the dealer as per the plot.
17
Plot comparing fuel and engine power
ggplot(data = bmw_pricing_challenge) + geom_point(mapping =
aes(x=fuel,y=engine_power,colour=mileage))
INTERPRETATION
Mid power ranges to call cars irrespective to gasoline used
Petrol cars have high power
Diesel cars are prone to low and mid-range power with low power.
Hybrid cars provide low to high engine power as plot defines it
clearly
Certain diesel cars show high mileage irrespective to the high
engine power
18
4. MODELLING
4.1 LINEAR REGRESSION
Creating a Linear Regression in R.
In this case, linear regression assumes that there exists a linear relationship
between the response variable and the explanatory variables. This means that
you can fit a line between the two (or more variables). ... A linear
regression can be calculated in R with the command lm .
Step 1: Load the data into R. Follow these four steps for each dataset:
Step 2: Make sure your data meet the assumptions.
Step 3: Perform the linear regression analysis.
Step 4: Check for homoscedasticity
Step 5: Visualize the results with a graph.
Step 6: Report your results.
mod$coefficients
mod$coef
19
coef(mod)
20
4.2 KNN:
The KNN or k-nearest neighbors algorithm is one of the simplest machine
learning algorithms and is an example of instance-based learning, where new
data are classified based on stored, labeled instances.
21
AppData_ConfusionMatrix <- confusionMatrix(test_df$ml,
test_df$suicides_no)
AppData_ConfusionMatrix
INTERPRETATION
The Accuracy of model determines the overall predicted accuracy of model, the accuracy of
current model is 99.95%
Sensitivity represents the number of positives returned by the model; the sensitivity of the
current model is 100%
Specificity represents the number of negatives returned by the model; the specificity of the
current model is 100%
4.3DECISION TREE
library(ISLR)
library(tree)
data1<-read.csv(file.choose(),header=TRUE)
summary(bmw_pricing_challenge)
names(bmw_pricing_challenge)
bmw_pricing_challenge$engine_power<as.numeric(bmw_pricing_challeng
e$engine_power)
bmw_pricing_challenge$sold_at <-
as.numeric(bmw_pricing_challenge$sold_at)
22
bmw_pricing_challenge$mileage <-
as.numeric(bmw_pricing_challenge$mileage)
bmw_pricing_challenge$price <-
as.numeric(bmw_pricing_challenge$price)
buy=ifelse(data1$price ==1000,"Yes","No")
data=data.frame(data,buy)
names(data)
data=data[,-1]
data=data[,-24]
names(data)
tree.data=tree(loan~.,data=data)
plot(tree.data)
text(tree.data, pretty = 0)
set.seed(2)
train=sample(1:nrow(data),nrow(data)/2)
test=-train
training_data=data[train,]
testing_data=data[test,]
testing_loan=loan[test]
tree_model=tree(loan~.,training_data)
plot(tree_model)
text(tree_model, pretty=0)
tree_pred=predict(tree_model,testing_data, type = "class")
mean(tree_pred!=testing_loan)
set.seed(3)
cv_tree= cv.tree(tree_model, FUN=prune.misclass)
cv_tree
plot(cv_tree$size,cv_tree$dev,type = "b")
pruned_model=prune.misclass(tree_model, best = 4)
plot(pruned_model)
text(pruned_model, pretty=0)
tree_pred = predict(pruned_model, testing_data, type="class")
mean(tree_pred!=testing_loan)
23
OUTPUT
24
5. ACTIONABLE INSIGHTS AND RECOMMENDATIONS
• This dataset is about a BMW’S price fixation on second hand cars
• Cars purchased on second hand dealers where more of diesel petroleum with
low and medium powerd
• Hybrid petrol and electric cars were sold even though of low prise sales was also
low
25