Professional Documents
Culture Documents
Machine Learning
Machine Learning
1. Data Cleaning: Treatment of data set for missing values, Normalizing the data
Set
2. Data Partition: Dividing the data set into Training and Testing (generally 80:20)
3. Training/ Model Building: Use of various Supervised and Unsupervised Model
for training
4. Testing/ Validation: Using the Confusion Matrix determine the validation of
Model
Different Models:
KNN Model:
NB Model:
The Negative Binomial regression model is used for modeling counts-based data
sets. It produces results that are: Explainable, Comparable, Defensible and Usable.
It is backed by statistical theory that is strong and very well understood. For doing
regression on countsbased data sets, a good strategy to follow is to start with the
Negative Binomial regression model.
GLM Model:
https://www.kaggle.com/antfarol/car-sale-advertisements/version/1
Attribute Information:
A data frame with 299 observations of 9 variables, 2 being int variable, 2 being
numerical variable, 5 being factor variable and 1 being target class.
str(rakesh)
'data.frame': 299 obs. of 9 variables:
$ car : Factor w/ 28 levels "Alfa Romeo","Audi",..: 7 15 17 8 21
17 3 15 17 12 ...
$ price : num 15500 17800 16600 6500 10500 ...
$ body : Factor w/ 6 levels "crossover","hatch",..: 1 6 1 4 5 1 4
1 1 1 ...
$ mileage : int 68 162 83 199 185 2 2 0 83 0 ...
$ engV : num 2.5 1.8 2 2 1.5 1.2 5 3 2 4.4 ...
$ engType : Factor w/ 4 levels "Diesel","Gas",..: 2 1 4 4 1 4 4 4 4 1
...
$ registration: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ year : int 2010 2012 2013 2003 2011 2016 2016 2016 2013 2016
...
$ drive : Factor w/ 2 levels "front","full": 2 1 2 1 1 1 2 2 2 2
...
On the basis of these features target column is determined In this dataset, Drive
is the target column which is to be predicted. There are 299 instances and 9
attributes.
>
>tail(rakesh)
tail(rakesh)
car price body mileage engV engType registration
294 Renault 12499 van 78 2.0 Diesel yes
295 Nissan 15000 crossover 46 1.6 Petrol yes
296 Toyota 37700 crossover 47 5.7 Gas yes
297 Suzuki 9650 crossover 81 2.0 Petrol yes
298 Volkswagen 7000 sedan 185 2.8 Gas yes
299 Honda 13200 sedan 100 1.8 Petrol yes
year drive
294 2012 front
295 2013 front
296 2009 full
297 2006 full
298 2004 full
299 2012 front
>
SYNTAX/CODE :
DATA :
1. Set the library :
>getwd()
2. loading packages
library(caret)
library(lattice)
library(ggplot2)
library(klaR)
>sum(is.na(rakesh))
[1] 0
Since, there are no missing values so no need of omit function. Now the data
is good for next step.
DATA PARTIOTINING :
In this step the data is divided into two parts i.e., testing and training. The ratio
can be varied. Generally, Training (70%-80%), Testing (20%-30%) is followed.
> dim(training)
[1] 240 9
> dim(testing)
[1] 59 9
Model Building:
• For NB model :
modelfit2=train(drive~.,data=training,method="nb")
modelfit3=train(drive~.,data=training,method="glm")
MODEL VALIDATION :
>predections1=predict(modelfit1,newdata=testing)
>confusionMatrix(predections1,testing$drive)
Reference
Prediction front full
front 19 5
full 7 28
Accuracy : 0.7966
95% CI : (0.6717, 0.8902)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 0.0001191
Kappa : 0.584
Sensitivity : 0.7308
Specificity : 0.8485
Pos Pred Value : 0.7917
Neg Pred Value : 0.8000
Prevalence : 0.4407
Detection Rate : 0.3220
Detection Prevalence : 0.4068
Balanced Accuracy : 0.7896
'Positive' Class : front:
For NB model:
>predections2=predict(modelfit2,newdata=testing)
>confusionMatrix(predections2,testing$drive)
Reference
Prediction front full
front 23 5
full 3 28
Accuracy : 0.8644
95% CI : (0.7502, 0.9396)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 5.256e-07
Kappa : 0.7272
Sensitivity : 0.8846
Specificity : 0.8485
Pos Pred Value : 0.8214
Neg Pred Value : 0.9032
Prevalence : 0.4407
Detection Rate : 0.3898
Detection Prevalence : 0.4746
Balanced Accuracy : 0.8666
Accuracy : 0.8814
95% CI : (0.7707, 0.9509)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 9.938e-08
Kappa : 0.7603
Sensitivity : 0.8846
Specificity : 0.8788
Pos Pred Value : 0.8519
Neg Pred Value : 0.9062
Prevalence : 0.4407
Detection Rate : 0.3898
Detection Prevalence : 0.4576
Balanced Accuracy : 0.8817
Hence from the three above parameters the GLM model is best one as the GLM
is having better accuracy, sensitivity, kappa value than the other two models.
PRACTICAL UTILITY OF THE MODEL: