You are on page 1of 3

Report on ML Email-Spam classification Model

The provided information includes the evaluation results of a Naive Bayes classifier that was trained
using a supervised approach to classify spam and non-spam emails. The dataset includes 4601 instances
and 10 attributes including various word frequencies, character frequencies, and capital run length
averages. The model was tested on the training data itself and achieved an overall accuracy of 83.09%.

The detailed accuracy by class indicates that the model performed significantly better in classifying non-
spam emails (class 0) than spam emails (class 1). Specifically, the true positive rate for class 0 was 97.0%,
indicating that 97.0% of actual non-spam emails were correctly classified as such by the model, while the
true positive rate for class 1 was only 61.7%, indicating that only 61.7% of actual spam emails were
correctly classified. The false positive rate was 38.3% for class 0 and 3.0% for class 1. The precision for
class 0 was 79.6%, indicating that 79.6% of emails classified as non-spam by the model were actually
non-spam, while the precision for class 1 was much higher at 93.0%, indicating that 93.0% of emails
classified as spam by the model were actually spam.

The confusion matrix shows that out of the 2788 non-spam emails, the model correctly classified 2704
(97.0%) as non-spam and misclassified 84 (3.0%) as spam. Out of the 1813 spam emails, the model
correctly classified 1119 (61.7%) as spam and misclassified 694 (38.3%) as non-spam.

The kappa statistic was 0.6238, indicating a substantial level of agreement between the actual and
predicted classifications. The mean absolute error was 0.1696 and the root mean squared error was
0.3984, which indicates that the model's predictions were relatively close to the actual values. However,
the relative absolute error was high at 35.5209%, indicating that the model's predictions had a
significant level of error compared to the actual values. The root relative squared error was also high at
81.5284%, indicating a relatively large deviation between the predicted and actual values.

In summary, the Naive Bayes classifier achieved an overall accuracy of 83.09% in classifying spam and
non-spam emails, with better performance in classifying non-spam emails than spam emails. The
model's precision in classifying spam emails was higher than in classifying non-spam emails, but the false
positive rate for non-spam emails was much higher than for spam emails. The kappa statistic indicated
substantial agreement between the actual and predicted classifications, but the relative absolute error
and root relative squared error were high, indicating a significant level of error in the model's
predictions.
=== Summary of model analysis ===

Correctly Classified Instances 3823 83.0906 %

Incorrectly Classified Instances 778 16.9094 %

Kappa statistic 0.6238

Mean absolute error 0.1696

Root mean squared error 0.3984

Relative absolute error 35.5209 %

Root relative squared error 81.5284 %

Total Number of Instances 4601

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.970 0.383 0.796 0.970 0.874 0.653 0.942 0.958 0

0.617 0.030 0.930 0.617 0.742 0.653 0.942 0.907 1

Weighted Avg. 0.831 0.244 0.849 0.831 0.822 0.653 0.942 0.938

=== Confusion Matrix ===

a b <-- classified as

2704 84 | a = 0

694 1119 | b = 1

Steps in creating the model

Load dataset in Weka

 Click on the "Preprocess" tab and select the "AttributeSelection" filter

 In the filter options, select a feature selection algorithm, such as CfsSubsetEval, and
configure the options as needed
 Apply the filter to your dataset and save the new feature-selected dataset for further
processing

Information gain and chi-squared feature selection: We can use information gain and chi-squared
feature selection filters to identify the most important features in the dataset. These filters calculate the
relevance of each feature for the classification task and select the most informative features. WEKA
provides information gain and chi-squared feature selection filters that we can use.

Go to the "Select Attributes" tab

Choose the "InfoGainAttributeEval" or "ChiSquaredAttributeEval" filter from the "Evaluator" dropdown


menu

Choose the "Ranker" search method from the "Search" dropdown menu

Click the "Start" button to apply the filter

The ranked list of attributes will be shown in the "Selected attributes" pane

Select the least four attributes of the top-ranked attributes and click "Apply" to select them

Algorithm selection:

 Open WEKA Explorer and load your dataset

 Click on the "Classify" tab and select the "Choose" button to select the classifier you
want to use

 Select a classification algorithm Naive Bayes

 Configure the options of the classifier as needed and click "OK" to apply the classifier to
your dataset

You might also like