You are on page 1of 11

Data Profiling Report(EDA)

Basic Statistics
Raw Counts
Name Value

Rows 418

Columns 9

Discrete columns 2

Continuous columns 7

All missing columns 0

Missing observations 0

Complete Rows 418

Total observations 3,762

Memory allocation 23.5 Kb

Data Structure
Root (Classes 'data.table' and 'data.frame': 418 obs. of 9 variables:)
Age (int)
Gender (Factor w/ 2 levels "Female","Male")
Engineer (int)
MBA (num)
Work.Exp (int)
Salary (num)
Distance (num)
license (int)
Transport (Factor w/ 3 levels "2Wheeler","Car",)
Missing Data Profile

Removing Missing data (Data$MBA[is.na(Data$MBA=="")]<-0)

Inference:
>Column “MBA” has missing data(blanks),and it has been filled with “0” .
Histogram

Bar Chart (by frequency)

Inference:
Histogram Age – More number of samples are aged between 25-30. Distance –
Average distance is from 7- 17 kilometeres.Salary is averaged between 0-17, &
Experience of sample’s are averaged between 3 – 10.
QQ Plot

Box plot:

Inference:
Only “Salary” has outlier in correspondence to dependent variable.
Multicollinearity
Correlation Analysis
model1 <- glm(Car ~ ., data=Newdata1, family=binomial,control = list(maxit = 250))
car::vif(model1)
Age Engineer MBA Work.Exp Salary Distance license
1.260880 1.144516 2.458355 28.208192 9.719509 3.090905 3.671684
Male
2.345866

Inference:
Gender, Work experience has high colleration and causing Multi-collinearity.Hence these
columns has been removed.

Modelling Data:
> print(table(Finaldata$Car))

0 1
383 35
> print(prop.table(table(Newdata$Car)))

0 1
0.91626794 0.08373206

Dependent variable is less than 8% of total data.


Logistic Regression: (Unbalanced data)
> unbalanced.logit.model <- glm(Car~., data=train, family=binomial(link="l
ogit"),control= list(maxit = 100))
> summary(unbalanced.logit.model)

Call:
glm(formula = Car ~ ., family = binomial(link = "logit"), data = train,
control = list(maxit = 100))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.84938 -0.01215 -0.00232 -0.00036 2.34618

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.2313 14.7168 -2.802 0.00508 **
Age 0.6195 0.3678 1.684 0.09211 .
Engineer -1.5730 2.1099 -0.746 0.45594
MBA -0.4184 1.7206 -0.243 0.80785
Salary 0.1962 0.1123 1.747 0.08061 .
Distance 1.0686 0.4720 2.264 0.02357 *
license -0.1616 2.2949 -0.070 0.94386
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 179.159 on 312 degrees of freedom


Residual deviance: 13.597 on 306 degrees of freedom
AIC: 27.597

Number of Fisher Scoring iterations: 11


Inference:
Only “Distance” has 0.01 significance of dependent variable”Car”.

Predicting with Test Data:


> test$log.pred<-predict(unbalanced.logit.model, newdata = test, type="res
ponse")
> table(test$Car,test$log.pred>0.5)

FALSE TRUE
0 96 0
1 2 7

KNN:
> knn_fit<- knn(train, test, cl= train$Car,k = 3,prob=TRUE)
> table(knn_fit,test$Car)

knn_fit 0 1
0 96 1
1 0 8

NaïveBayes:
> nb_gd<-naiveBayes(x=train[,1:6], y=as.factor(train[,7]))
> pred_nb<-predict(nb_gd,newdata = test[,1:6])
> table(test[,7],pred_nb)
pred_nb
0 1
0 92 4
1 1 8

SMOTE( Balanced Data).


balanced.data <- SMOTE(Car ~., smote.train, perc.over = 100, k = 5,perc.un
der = 800)
> print(table(balanced.data$Car))

0 1
160 40
> print(prop.table(table(balanced.data$Car)))

0 1
0.8 0.2

XGB Boost:
> smote_features_test<-as.matrix(smote.test[,1:7])
> smote.test$smote.pred.class <- predict(smote.xgb.fit, smote_features_tes
t)
> table(smote.test$Car,smote.test$smote.pred.class>=0.5)

FALSE TRUE
0 115 0
1 0 11
Logistic Regression(With Balanced data)
> logit.model3 <- bayesglm(Car~., data=balanced.train, family=binomial(lin
k="logit"))
> summary(logit.model3)

Call:
bayesglm(formula = Car ~ ., family = binomial(link = "logit"),
data = balanced.train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.27681 -0.02164 -0.00330 -0.00020 1.36543

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -52.90466 14.66170 -3.608 0.000308 ***
Age 1.22884 0.39027 3.149 0.001640 **
Engineer -0.42372 1.19192 -0.355 0.722219
MBA -1.43326 1.23631 -1.159 0.246334
Salary 0.03051 0.05858 0.521 0.602447
Distance 0.89766 0.24948 3.598 0.000320 ***
license 1.41895 1.26419 1.122 0.261686
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.346 on 263 degrees of freedom


Residual deviance: 12.425 on 257 degrees of freedom
AIC: 26.425

Number of Fisher Scoring iterations: 35

Age and Distance is significant variables.

Re-running Logistic regression model with Age and Distance.


Call:
bayesglm(formula = Car ~ Age + Distance, family = binomial(link = "logit")
,
data = balanced.train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.39500 -0.03094 -0.00538 -0.00049 1.78435

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -50.3089 12.4868 -4.029 0.000056 ***
Age 1.1571 0.3112 3.718 0.000201 ***
Distance 0.9080 0.2398 3.786 0.000153 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.346 on 263 degrees of freedom


Residual deviance: 16.576 on 261 degrees of freedom
AIC: 22.576

Number of Fisher Scoring iterations: 27


KNN with Balanced Data
library(class)
> knn_fit<- knn(train = logit.train[,1:6], test = logit.test[,1:6], cl= lo
git.train[,7],k = 3,prob=TRUE)
> table(logit.test[,7],knn_fit)
knn_fit
0 1
0 65 0
1 1 13

Naïve Bayes with Balanced Data.


> nb_gd<-naiveBayes(x=logit.train[,1:6], y=as.factor(logit.train[,7]))
> pred_nb<-predict(nb_gd,newdata = logit.test[,1:6])
> table(logit.test[,7],pred_nb)
pred_nb
0 1
0 64 1
1 2 12

Confusion Matrix:

For logistic Regression:


> logit.test$log.pred<-predict(logit.model4, newdata = logit.test, type="r
esponse")
> table(logit.test$Car,logit.test$log.pred>0.5)

FALSE TRUE
0 65 0
1 1 13
For KNN
> table(logit.test[,7],knn_fit)
knn_fit
0 1
0 65 0
1 1 13
For Naïve Bayes
> table(logit.test[,7],pred_nb)
pred_nb
0 1
0 64 1
1 2 12

Inference:

Logistic regression and KNN model is providing similar results in confusion matrix.
Bagging and Boosting:
Bagging
Car.bagging <- bagging(Car ~.,

data=train,

control=rpart.control(maxdepth=5, minsplit=4))

test$pred.class <- predict(Car.bagging, test)

test$pred.class<- ifelse(test$pred.class<0.5,0,1)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 96 2
1 0 7

Accuracy : 0.981
95% CI : (0.9329, 0.9977)
No Information Rate : 0.9143
P-Value [Acc > NIR] : 0.004822

Kappa : 0.8649

Mcnemar's Test P-Value : 0.479500

Sensitivity : 0.77778
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 0.97959
Prevalence : 0.08571
Detection Rate : 0.06667
Detection Prevalence : 0.06667
Balanced Accuracy : 0.88889

'Positive' Class : 1

> table(test$Car,test$pred.class>0.5)

FALSE TRUE
0 96 0
1 2 7

Bagging:
gbm.fit <- gbm(formula = Car ~ .,distribution = "bernoulli",data = train,n.trees =
10000,interaction.depth = 1,shrinkage = 0.001,cv.folds = 5,n.cores = NULL, verbose = FALSE)

test$pred.car <- predict(gbm.fit, test, type = "response")

table(test$Car,test$pred.car>0.5)

confusionMatrix(data=factor(test$pred.car),reference=factor(test$Car),positive='1')

table(test$Car,test$pred.class>0.5)
> table(test$Car,test$pred.car>0.5)

FALSE TRUE
0 96 0
1 1 8

Insights & Recommendations

>After balancing data(through Smote), Logistic regression and KNN have similar output
and both models provide similar true positive’s and true negatives in confusion matrix.
>Both Bagging and Boosting is provided lesser accuracy than Logistic and KNN model.

You might also like