Top 5 variables for identifying spam emails

Important Variables in classifying the Spam or Non-Spam message:
For finding out the best feature which defines the predicted outcome we have used “ReliefF”
weighting technique that estimates the quality of the features from a given dataset by assigning
weights to each of them.
The tool used here is Orange Data Mining tool to analyse and plot various results.
Figure 1: Ranking of variables through ReliefF weights
The above figure shows ReliefF weights in the descending order out of which top 5 are the features
(Independent Variables) which has the maximum weightage and are important in deciding the
predictive outcome of the model
Following are the important variables in context of predicting outcome:
• Word19
• Word12
• Word46
• Word18
• Word21
Further analysing the above variables with various plots to choose best out of these 5 variables:
1. Box Plots – Box plot is generally drawn between one qualitative variable and another
quantitative variable. The quantitative variable which can clearly distinguish between the
categories of the qualitative variable is proved to be more important in predicting the
outcome.
Figure 2: Box Plot for Variable – Word19

2. Distribution: The value distribution plot shows distribution of data for each independent
variable corresponding to categories of the dependent variable. This helps us visualize the
data concentration of each independent variable which can clearly distinguish the output
categories. Following are the distribution plot for top 5 identified variables.
Figure 7: Word 19 Figure 8: Word 12
Figure 9: Word 46 Figure 10: Word 18

Figure 11: Word 21
From above two analysis is clear that word19 and word21 are the most important variables
for prediction of the mail to be spam or not.
Model creation and evaluation for proposing the best model to predict the
outcome of a mail to be spam or not
CRISPDM framework for data analysis and model development
1. Business understanding – Email plays an important part of our life, if you are a working
professional, it matters to you the most. Increasing use of emails for communications has
also attracted many fraudsters to compromise with your personal data. This data analysis
deals with studying and choosing the best model to detect a spam email so that we can be
sure which mail is good for us and which is not. A hypothesis testing is done to test this:
H0 – Mail is not spam
H1 – Mail is spam
2. Data Understanding – The given dataset has 1 dependent variable and 57 independent
variables with each variable defined the average frequency of a word occurring an email.
Further statistics are given below:
3. Data Preparation -
Given dataset is divided into 2 parts:
a) Training Dataset – 70% of the original dataset, to train and develop our model
b) Validation Dataset – 30% of the original dataset, disjoint to training dataset which will be
used to validate our models and choose the best out of it.
4. Model Development -
The supervised learning model which can be tested on this data are as follows:
• C5 classification
• RPART classification
• CTREE classification
• Binary logistic regression
All the above modelling techniques has been trained on the dataset. A complete code of
model development is coded in R Studio and been published on RPubs. The results can be
seen under this link:
https://rpubs.com/krishece11/716251
5. Model Evaluation –
For evaluating the models developed through training dataset, we have used validation
dataset to test the predicted outcomes. Based on the confusion matrix we have used
parameters such as accuracy, NIR, Sensitivity and Specificity to filter out the best model.
Table 1: pivot table from given dataset with actual number of mails as spam or not
Therefore, it’s No Information Ratio(NIR) = 13940/23005 = 60.059%

For a model to be valid, it must have Accuracy ratio more than NIR
Table 2: Model evaluation table
Logistic
C5 RPART CTREE
Regression
Accuracy 0.9957 0.6091 0.9423 0.9341
NIR 60.06%
Positive
No
Class
Sensitivity 0.9976 1 0.9465 0.8977
Specificity 0.9926 0 0.9359 0.9574
From above table C5 classification tree model has the maximum accuracy, Sensitivity and
Specificity while predicting the output from validation dataset. Therefore, C5 is the best
model which can predict the outcome of a mail whether it is spam or not.

Top 5 variables for identifying spam emails

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Top 5 variables for identifying spam emails

Uploaded by

Copyright:

Available Formats

Important Variables in classifying the Spam or Non-Spam message:

Figure 1: Ranking of variables through ReliefF weights

Following are the important variables in context of predicting outcome:

Figure 3: Box Plot for Variable – Word12

Figure 4: Box Plot for Variable – Word46

Figure 5: Box Plot for Variable – Word18

Figure 7: Word 19 Figure 8: Word 12

Figure 9: Word 46 Figure 10: Word 18

CRISPDM framework for data analysis and model development

Therefore, it’s No Information Ratio(NIR) = 13940/23005 = 60.059%

Table 2: Model evaluation table

You might also like