You are on page 1of 5

Important Variables in classifying the Spam or Non-Spam message:

For finding out the best feature which defines the predicted outcome we have used “ReliefF”
weighting technique that estimates the quality of the features from a given dataset by assigning
weights to each of them.

The tool used here is Orange Data Mining tool to analyse and plot various results.

Figure 1: Ranking of variables through ReliefF weights

The above figure shows ReliefF weights in the descending order out of which top 5 are the features
(Independent Variables) which has the maximum weightage and are important in deciding the
predictive outcome of the model

Following are the important variables in context of predicting outcome:

• Word19
• Word12
• Word46
• Word18
• Word21

Further analysing the above variables with various plots to choose best out of these 5 variables:

1. Box Plots – Box plot is generally drawn between one qualitative variable and another
quantitative variable. The quantitative variable which can clearly distinguish between the
categories of the qualitative variable is proved to be more important in predicting the
outcome.
Figure 2: Box Plot for Variable – Word19

Figure 3: Box Plot for Variable – Word12

Figure 4: Box Plot for Variable – Word46

Figure 5: Box Plot for Variable – Word18


Figure 6: Box Plot for Variable – Word21

2. Distribution: The value distribution plot shows distribution of data for each independent
variable corresponding to categories of the dependent variable. This helps us visualize the
data concentration of each independent variable which can clearly distinguish the output
categories. Following are the distribution plot for top 5 identified variables.

Figure 7: Word 19 Figure 8: Word 12

Figure 9: Word 46 Figure 10: Word 18


Figure 11: Word 21

From above two analysis is clear that word19 and word21 are the most important variables
for prediction of the mail to be spam or not.

Model creation and evaluation for proposing the best model to predict the
outcome of a mail to be spam or not

CRISPDM framework for data analysis and model development

1. Business understanding – Email plays an important part of our life, if you are a working
professional, it matters to you the most. Increasing use of emails for communications has
also attracted many fraudsters to compromise with your personal data. This data analysis
deals with studying and choosing the best model to detect a spam email so that we can be
sure which mail is good for us and which is not. A hypothesis testing is done to test this:
H0 – Mail is not spam
H1 – Mail is spam
2. Data Understanding – The given dataset has 1 dependent variable and 57 independent
variables with each variable defined the average frequency of a word occurring an email.
Further statistics are given below:
3. Data Preparation -
Given dataset is divided into 2 parts:
a) Training Dataset – 70% of the original dataset, to train and develop our model
b) Validation Dataset – 30% of the original dataset, disjoint to training dataset which will be
used to validate our models and choose the best out of it.
4. Model Development -
The supervised learning model which can be tested on this data are as follows:
• C5 classification
• RPART classification
• CTREE classification
• Binary logistic regression
All the above modelling techniques has been trained on the dataset. A complete code of
model development is coded in R Studio and been published on RPubs. The results can be
seen under this link:
https://rpubs.com/krishece11/716251

5. Model Evaluation –
For evaluating the models developed through training dataset, we have used validation
dataset to test the predicted outcomes. Based on the confusion matrix we have used
parameters such as accuracy, NIR, Sensitivity and Specificity to filter out the best model.

Table 1: pivot table from given dataset with actual number of mails as spam or not

Therefore, it’s No Information Ratio(NIR) = 13940/23005 = 60.059%


For a model to be valid, it must have Accuracy ratio more than NIR

Table 2: Model evaluation table

Logistic
C5 RPART CTREE
Regression
Accuracy 0.9957 0.6091 0.9423 0.9341
NIR 60.06%
Positive
No
Class
Sensitivity 0.9976 1 0.9465 0.8977
Specificity 0.9926 0 0.9359 0.9574

From above table C5 classification tree model has the maximum accuracy, Sensitivity and
Specificity while predicting the output from validation dataset. Therefore, C5 is the best
model which can predict the outcome of a mail whether it is spam or not.

You might also like