Machine Learning - Project

Data Profiling Report(EDA)
Basic Statistics
Raw Counts
Name Value
Rows 418
Columns 9
Discrete columns 2
Continuous columns 7
All missing columns 0
Missing observations 0
Complete Rows 418
Total observations 3,762
Memory allocation 23.5 Kb
Data Structure
Root (Classes 'data.table' and 'data.frame': 418 obs. of 9 variables:)
Age (int)
Gender (Factor w/ 2 levels "Female","Male")
Engineer (int)
MBA (num)
Work.Exp (int)
Salary (num)
Distance (num)
license (int)
Transport (Factor w/ 3 levels "2Wheeler","Car",)
Missing Data Profile
Removing Missing data (Data$MBA[is.na(Data$MBA=="")]<-0)
Inference:
>Column “MBA” has missing data(blanks),and it has been filled with “0” .
Histogram
Bar Chart (by frequency)
Inference:
Histogram Age – More number of samples are aged between 25-30. Distance –
Average distance is from 7- 17 kilometeres.Salary is averaged between 0-17, &
Experience of sample’s are averaged between 3 – 10.
QQ Plot
Box plot:
Inference:
Only “Salary” has outlier in correspondence to dependent variable.
Multicollinearity
Correlation Analysis
model1 <- glm(Car ~ ., data=Newdata1, family=binomial,control = list(maxit = 250))
car::vif(model1)
Age Engineer MBA Work.Exp Salary Distance license
1.260880 1.144516 2.458355 28.208192 9.719509 3.090905 3.671684
Male
2.345866
Inference:
Gender, Work experience has high colleration and causing Multi-collinearity.Hence these
columns has been removed.
Modelling Data:
> print(table(Finaldata$Car))
0 1
383 35
> print(prop.table(table(Newdata$Car)))
0 1
0.91626794 0.08373206
Dependent variable is less than 8% of total data.

Logistic Regression: (Unbalanced data)
> unbalanced.logit.model <- glm(Car~., data=train, family=binomial(link="l
ogit"),control= list(maxit = 100))
> summary(unbalanced.logit.model)
Call:
glm(formula = Car ~ ., family = binomial(link = "logit"), data = train,
control = list(maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.84938 -0.01215 -0.00232 -0.00036 2.34618
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.2313 14.7168 -2.802 0.00508 **
Age 0.6195 0.3678 1.684 0.09211 .
Engineer -1.5730 2.1099 -0.746 0.45594
MBA -0.4184 1.7206 -0.243 0.80785
Salary 0.1962 0.1123 1.747 0.08061 .
Distance 1.0686 0.4720 2.264 0.02357 *
license -0.1616 2.2949 -0.070 0.94386
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 179.159 on 312 degrees of freedom

Residual deviance: 13.597 on 306 degrees of freedom
AIC: 27.597
Number of Fisher Scoring iterations: 11

Inference:
Only “Distance” has 0.01 significance of dependent variable”Car”.
Predicting with Test Data:

> test$log.pred<-predict(unbalanced.logit.model, newdata = test, type="res
ponse")
> table(test$Car,test$log.pred>0.5)
FALSE TRUE
0 96 0
1 2 7
KNN:
> knn_fit<- knn(train, test, cl= train$Car,k = 3,prob=TRUE)
> table(knn_fit,test$Car)
knn_fit 0 1
0 96 1
1 0 8
NaïveBayes:
> nb_gd<-naiveBayes(x=train[,1:6], y=as.factor(train[,7]))
> pred_nb<-predict(nb_gd,newdata = test[,1:6])
> table(test[,7],pred_nb)
pred_nb
0 1
0 92 4
1 1 8
SMOTE( Balanced Data).

balanced.data <- SMOTE(Car ~., smote.train, perc.over = 100, k = 5,perc.un
der = 800)
> print(table(balanced.data$Car))
0 1
160 40
> print(prop.table(table(balanced.data$Car)))
0 1
0.8 0.2
XGB Boost:
> smote_features_test<-as.matrix(smote.test[,1:7])
> smote.test$smote.pred.class <- predict(smote.xgb.fit, smote_features_tes
t)
> table(smote.test$Car,smote.test$smote.pred.class>=0.5)
FALSE TRUE
0 115 0
1 0 11
Logistic Regression(With Balanced data)
> logit.model3 <- bayesglm(Car~., data=balanced.train, family=binomial(lin
k="logit"))
> summary(logit.model3)
Call:
bayesglm(formula = Car ~ ., family = binomial(link = "logit"),
data = balanced.train)
Deviance Residuals:
-1.27681 -0.02164 -0.00330 -0.00020 1.36543
Coefficients:
(Intercept) -52.90466 14.66170 -3.608 0.000308 ***
Age 1.22884 0.39027 3.149 0.001640 **
Engineer -0.42372 1.19192 -0.355 0.722219
MBA -1.43326 1.23631 -1.159 0.246334
Salary 0.03051 0.05858 0.521 0.602447
Distance 0.89766 0.24948 3.598 0.000320 ***
license 1.41895 1.26419 1.122 0.261686
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 26.425
Age and Distance is significant variables.
Re-running Logistic regression model with Age and Distance.

Call:
bayesglm(formula = Car ~ Age + Distance, family = binomial(link = "logit")
,
data = balanced.train)
Deviance Residuals:
-1.39500 -0.03094 -0.00538 -0.00049 1.78435
Coefficients:
(Intercept) -50.3089 12.4868 -4.029 0.000056 ***
Age 1.1571 0.3112 3.718 0.000201 ***
Distance 0.9080 0.2398 3.786 0.000153 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 22.576

KNN with Balanced Data
library(class)
> knn_fit<- knn(train = logit.train[,1:6], test = logit.test[,1:6], cl= lo
git.train[,7],k = 3,prob=TRUE)
> table(logit.test[,7],knn_fit)
knn_fit
0 1
0 65 0
1 1 13
Naïve Bayes with Balanced Data.

> nb_gd<-naiveBayes(x=logit.train[,1:6], y=as.factor(logit.train[,7]))
> pred_nb<-predict(nb_gd,newdata = logit.test[,1:6])
> table(logit.test[,7],pred_nb)
pred_nb
0 1
0 64 1
1 2 12
Confusion Matrix:
For logistic Regression:

> logit.test$log.pred<-predict(logit.model4, newdata = logit.test, type="r
esponse")
> table(logit.test$Car,logit.test$log.pred>0.5)
FALSE TRUE
0 65 0
1 1 13
For KNN
> table(logit.test[,7],knn_fit)
knn_fit
0 1
0 65 0
1 1 13
For Naïve Bayes
> table(logit.test[,7],pred_nb)
pred_nb
0 1
0 64 1
1 2 12
Inference:
Logistic regression and KNN model is providing similar results in confusion matrix.
Bagging and Boosting:
Bagging
Car.bagging <- bagging(Car ~.,
data=train,
control=rpart.control(maxdepth=5, minsplit=4))
test$pred.class <- predict(Car.bagging, test)
test$pred.class<- ifelse(test$pred.class<0.5,0,1)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 96 2
1 0 7
Accuracy : 0.981
95% CI : (0.9329, 0.9977)
No Information Rate : 0.9143
P-Value [Acc > NIR] : 0.004822
Kappa : 0.8649
Mcnemar's Test P-Value : 0.479500
Sensitivity : 0.77778
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 0.97959
Prevalence : 0.08571
Detection Rate : 0.06667
Detection Prevalence : 0.06667
Balanced Accuracy : 0.88889
'Positive' Class : 1
> table(test$Car,test$pred.class>0.5)
FALSE TRUE
0 96 0
1 2 7
Bagging:
gbm.fit <- gbm(formula = Car ~ .,distribution = "bernoulli",data = train,n.trees =
10000,interaction.depth = 1,shrinkage = 0.001,cv.folds = 5,n.cores = NULL, verbose = FALSE)
test$pred.car <- predict(gbm.fit, test, type = "response")
table(test$Car,test$pred.car>0.5)
confusionMatrix(data=factor(test$pred.car),reference=factor(test$Car),positive='1')
table(test$Car,test$pred.class>0.5)
> table(test$Car,test$pred.car>0.5)
FALSE TRUE
0 96 0
1 1 8
Insights & Recommendations
>After balancing data(through Smote), Logistic regression and KNN have similar output
and both models provide similar true positive’s and true negatives in confusion matrix.
>Both Bagging and Boosting is provided lesser accuracy than Logistic and KNN model.

Machine Learning - Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning - Project

Uploaded by

Copyright:

Available Formats

Data Profiling Report(EDA)

All missing columns 0

Complete Rows 418

Total observations 3,762

Memory allocation 23.5 Kb

Removing Missing data (Data$MBA[is.na(Data$MBA=="")]<-0)

Bar Chart (by frequency)

Dependent variable is less than 8% of total data.

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 179.159 on 312 degrees of freedom

Number of Fisher Scoring iterations: 11

Predicting with Test Data:

SMOTE( Balanced Data).

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.346 on 263 degrees of freedom

Number of Fisher Scoring iterations: 35

Age and Distance is significant variables.

Re-running Logistic regression model with Age and Distance.

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.346 on 263 degrees of freedom

Number of Fisher Scoring iterations: 27

Naïve Bayes with Balanced Data.

For logistic Regression:

test$pred.class <- predict(Car.bagging, test)

Mcnemar's Test P-Value : 0.479500

test$pred.car <- predict(gbm.fit, test, type = "response")

Insights & Recommendations

You might also like