Professional Documents
Culture Documents
Machine Learning - Project
Machine Learning - Project
Basic Statistics
Raw Counts
Name Value
Rows 418
Columns 9
Discrete columns 2
Continuous columns 7
Missing observations 0
Data Structure
Root (Classes 'data.table' and 'data.frame': 418 obs. of 9 variables:)
Age (int)
Gender (Factor w/ 2 levels "Female","Male")
Engineer (int)
MBA (num)
Work.Exp (int)
Salary (num)
Distance (num)
license (int)
Transport (Factor w/ 3 levels "2Wheeler","Car",)
Missing Data Profile
Inference:
>Column “MBA” has missing data(blanks),and it has been filled with “0” .
Histogram
Inference:
Histogram Age – More number of samples are aged between 25-30. Distance –
Average distance is from 7- 17 kilometeres.Salary is averaged between 0-17, &
Experience of sample’s are averaged between 3 – 10.
QQ Plot
Box plot:
Inference:
Only “Salary” has outlier in correspondence to dependent variable.
Multicollinearity
Correlation Analysis
model1 <- glm(Car ~ ., data=Newdata1, family=binomial,control = list(maxit = 250))
car::vif(model1)
Age Engineer MBA Work.Exp Salary Distance license
1.260880 1.144516 2.458355 28.208192 9.719509 3.090905 3.671684
Male
2.345866
Inference:
Gender, Work experience has high colleration and causing Multi-collinearity.Hence these
columns has been removed.
Modelling Data:
> print(table(Finaldata$Car))
0 1
383 35
> print(prop.table(table(Newdata$Car)))
0 1
0.91626794 0.08373206
Call:
glm(formula = Car ~ ., family = binomial(link = "logit"), data = train,
control = list(maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.84938 -0.01215 -0.00232 -0.00036 2.34618
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.2313 14.7168 -2.802 0.00508 **
Age 0.6195 0.3678 1.684 0.09211 .
Engineer -1.5730 2.1099 -0.746 0.45594
MBA -0.4184 1.7206 -0.243 0.80785
Salary 0.1962 0.1123 1.747 0.08061 .
Distance 1.0686 0.4720 2.264 0.02357 *
license -0.1616 2.2949 -0.070 0.94386
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
FALSE TRUE
0 96 0
1 2 7
KNN:
> knn_fit<- knn(train, test, cl= train$Car,k = 3,prob=TRUE)
> table(knn_fit,test$Car)
knn_fit 0 1
0 96 1
1 0 8
NaïveBayes:
> nb_gd<-naiveBayes(x=train[,1:6], y=as.factor(train[,7]))
> pred_nb<-predict(nb_gd,newdata = test[,1:6])
> table(test[,7],pred_nb)
pred_nb
0 1
0 92 4
1 1 8
0 1
160 40
> print(prop.table(table(balanced.data$Car)))
0 1
0.8 0.2
XGB Boost:
> smote_features_test<-as.matrix(smote.test[,1:7])
> smote.test$smote.pred.class <- predict(smote.xgb.fit, smote_features_tes
t)
> table(smote.test$Car,smote.test$smote.pred.class>=0.5)
FALSE TRUE
0 115 0
1 0 11
Logistic Regression(With Balanced data)
> logit.model3 <- bayesglm(Car~., data=balanced.train, family=binomial(lin
k="logit"))
> summary(logit.model3)
Call:
bayesglm(formula = Car ~ ., family = binomial(link = "logit"),
data = balanced.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.27681 -0.02164 -0.00330 -0.00020 1.36543
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -52.90466 14.66170 -3.608 0.000308 ***
Age 1.22884 0.39027 3.149 0.001640 **
Engineer -0.42372 1.19192 -0.355 0.722219
MBA -1.43326 1.23631 -1.159 0.246334
Salary 0.03051 0.05858 0.521 0.602447
Distance 0.89766 0.24948 3.598 0.000320 ***
license 1.41895 1.26419 1.122 0.261686
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Deviance Residuals:
Min 1Q Median 3Q Max
-1.39500 -0.03094 -0.00538 -0.00049 1.78435
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -50.3089 12.4868 -4.029 0.000056 ***
Age 1.1571 0.3112 3.718 0.000201 ***
Distance 0.9080 0.2398 3.786 0.000153 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Confusion Matrix:
FALSE TRUE
0 65 0
1 1 13
For KNN
> table(logit.test[,7],knn_fit)
knn_fit
0 1
0 65 0
1 1 13
For Naïve Bayes
> table(logit.test[,7],pred_nb)
pred_nb
0 1
0 64 1
1 2 12
Inference:
Logistic regression and KNN model is providing similar results in confusion matrix.
Bagging and Boosting:
Bagging
Car.bagging <- bagging(Car ~.,
data=train,
control=rpart.control(maxdepth=5, minsplit=4))
test$pred.class<- ifelse(test$pred.class<0.5,0,1)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 96 2
1 0 7
Accuracy : 0.981
95% CI : (0.9329, 0.9977)
No Information Rate : 0.9143
P-Value [Acc > NIR] : 0.004822
Kappa : 0.8649
Sensitivity : 0.77778
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 0.97959
Prevalence : 0.08571
Detection Rate : 0.06667
Detection Prevalence : 0.06667
Balanced Accuracy : 0.88889
'Positive' Class : 1
> table(test$Car,test$pred.class>0.5)
FALSE TRUE
0 96 0
1 2 7
Bagging:
gbm.fit <- gbm(formula = Car ~ .,distribution = "bernoulli",data = train,n.trees =
10000,interaction.depth = 1,shrinkage = 0.001,cv.folds = 5,n.cores = NULL, verbose = FALSE)
table(test$Car,test$pred.car>0.5)
confusionMatrix(data=factor(test$pred.car),reference=factor(test$Car),positive='1')
table(test$Car,test$pred.class>0.5)
> table(test$Car,test$pred.car>0.5)
FALSE TRUE
0 96 0
1 1 8
>After balancing data(through Smote), Logistic regression and KNN have similar output
and both models provide similar true positive’s and true negatives in confusion matrix.
>Both Bagging and Boosting is provided lesser accuracy than Logistic and KNN model.