EMAIL SPAM DETECTION USING LOGISTIC
REGRESSION AND SVM
Submitted by:
AMRITA P : Reg No. 65424825001
2025
MSc COMPUTER SCIENCE
CASE STUDY REPORT
Name: .....………………………………………………………….
Reg. No: ……………………… Course: ………………………..
Semester: ……………………………Year: ……………………...
Subject: ……………………………………………………………
Date: ……………………………………………………………….
Teacher in Charge H.O. D External
CONTENTS
1. INTRODUCTION 1
1.1. ABOUT THE TOPIC 2
1.2. PROBLEM STATEMENT 3
1.3. OBJECTIVE 4
1.4. DATA DESCRIPTION 5
2. IMPLEMENTATION 6
2.1. METHODS 9
2.2. ALGORITHM 12
2.3. SOURCE CODE 16
2.4. OUTPUT 20
3. CONCLUSION 24
4. BIBLIOGRAPHY 25
1
1. INTRODUCTION
In the digital era, the exponential growth of online communication has brought with it the
challenge of handling unsolicited and potentially harmful messages, commonly known as
spam. Machine learning (ML) has become a powerful tool to address this issue by enabling
systems to automatically learn patterns from historical data and make intelligent decisions
without human intervention. In this case study, we explore how machine learning techniques
can be used to detect and classify spam emails, thereby improving the security and efficiency
of email communication.
The problem of spam detection falls under the category of supervised classification, a type of
machine learning where the model learns from labeled data—in this case, emails labeled as
"spam" or "ham" (not spam). While the term “regression” usually refers to predicting
continuous values, logistic regression is a form of regression used for binary classification
problems. It estimates the probability that a given input belongs to a particular category. In
parallel, we also implement Support Vector Machines (SVM), a robust classification algorithm
that constructs optimal decision boundaries (hyperplanes) to separate spam from ham in
high-dimensional space, especially effective in text data with many features.
This case study demonstrates the application of Logistic Regression and SVM for spam email
detection using a public dataset of SMS/email messages. We begin by preprocessing the text
data—removing noise, converting to lowercase, eliminating stopwords—and then
transforming it into numerical features using TF-IDF vectorization. Both models are trained
and evaluated using performance metrics such as accuracy, precision, recall, and F1-score. The
results show that both algorithms are effective, with SVM slightly outperforming Logistic
Regression. This study not only highlights the practical application of ML for cybersecurity
but also compares the strengths of two widely-used classification algorithms.
2
1.1 ABOUT THE TOPIC
With the rise of digital communication, email remains a critical tool for both personal and
professional use. However, this convenience comes with the constant threat of spam emails,
which can include everything from unwanted advertisements to phishing attacks and malware.
As the volume and complexity of spam increase, traditional rule-based filtering systems have
proven insufficient. This has led to the adoption of Machine Learning (ML) techniques to
develop intelligent systems that can automatically identify and filter spam with high accuracy.
Email spam detection is a classic example of a binary classification problem in machine
learning, where messages are classified as either "spam" or "not spam" (ham). This case study
applies two widely-used ML algorithms—Logistic Regression and Support Vector Machines
(SVM)—to solve the classification task. Logistic Regression is a probabilistic model that
predicts the likelihood of an email being spam based on word patterns. In contrast, SVM is a
geometric model that finds the optimal boundary between the two classes by analyzing word
vectors in high-dimensional space, making it especially effective for sparse data like text.
The project involves cleaning and transforming raw text emails into numerical data using
techniques like stopword removal and TF-IDF vectorization, followed by training both models
and evaluating their performance. The results are measured using metrics such as accuracy,
precision, recall, and F1-score. Through this approach, the case study demonstrates how
machine learning models can significantly improve the efficiency of spam detection, reduce
false positives, and enhance email security. It also provides a comparison of the performance
and suitability of Logistic Regression and SVM in real-world spam detection systems.
3
1.2 PROBLEM STATEMENT
Email communication has become a fundamental part of personal, academic, and business
activities. However, the widespread use of email has also led to a significant rise in spam
emails, which include unsolicited advertisements, fraudulent messages, phishing scams, and
links to malicious websites. These spam messages not only disrupt user experience and
productivity but also pose serious security risks. Traditional rule-based spam filters rely on
predefined keywords or blacklisted addresses, which are often rigid, outdated, and easy for
spammers to bypass.
To address this challenge, machine learning offers a dynamic and intelligent solution by
learning patterns from previously labeled email data and automatically identifying new spam
messages. The core objective of this problem is to build a system that can efficiently and
accurately classify incoming emails as either spam or ham (not spam). Since this is a binary
classification problem, algorithms such as Logistic Regression and Support Vector Machines
(SVM) are appropriate for modeling the relationship between email content and its
classification.
The problem also requires addressing challenges like text preprocessing, handling imbalanced
datasets, and converting textual data into numerical features using TF-IDF vectorization. The
aim is to evaluate and compare the performance of Logistic Regression and SVM on the same
dataset using metrics such as accuracy, precision, recall, and F1-score. Solving this problem
has real-world impact, as it can improve spam filtering in email systems, protect users from
harmful content, and enhance the overall quality of digital communication.
4
1.3 OBJECTIVES
● To understand and apply machine learning techniques for binary classification
problems, specifically in the context of detecting spam emails.
● To build and train two classification models—Logistic Regression and Support Vector
Machine (SVM)—on a labeled dataset of emails or SMS messages.
● To preprocess and clean the email text data, including removing punctuation,
stopwords, numbers, and converting text to lowercase.
● To convert raw email content into numerical features using TF-IDF (Term
Frequency–Inverse Document Frequency) vectorization for effective model training.
● To evaluate and compare the performance of Logistic Regression and SVM models
using accuracy, precision, recall, F1-score, and confusion matrix.
● To identify the most suitable algorithm for spam detection based on model
performance and dataset characteristics.
● To demonstrate the real-world applicability of spam detection systems in enhancing
communication security and user experience.
● To gain practical experience in applying supervised learning techniques to real-world
text classification problems using Python and machine learning libraries.
5
1.4 DATA DESCRIPTION
The dataset used in this case study is the SMS Spam Collection Dataset, a widely-used
benchmark dataset for spam detection tasks. It contains 5,572 real-world text messages, each
labeled as either “ham” (legitimate) or “spam” (unwanted or malicious). The dataset is
publicly available from sources like UCI Machine Learning Repository and Kaggle.
Column Name Description
v1 The label of the message :”ham”(non-spam) or “spam” (unwanted
email)
v2 The actual text content of the message (SMS or email body)
For ease of use, the columns are renamed during preprocessing:
● v1 → label
● v2 → text
The dataset is imbalanced, with approximately 87% ham messages and 13% spam messages,
which reflects real-world scenarios where spam is a minority class. During analysis, the label
column is also converted into numeric form:
● ham → 0
● spam → 1
To prepare the text data for machine learning models, the following preprocessing steps are
applied:
● Conversion to lowercase
● Removal of punctuation, numbers, and stopwords
● Text vectorization using TF-IDF, transforming raw text into meaningful numeric
features
The cleaned and vectorized data is then used to train and test both Logistic Regression and
SVM classifiers for effective spam detection.
6
2. IMPLEMENTATION
The implementation of the email spam detection system involves a complete machine learning
pipeline, from data loading to model evaluation. The project is implemented using Python in a
Jupyter Notebook environment with libraries such as pandas, nltk, scikit-learn, matplotlib, and
seaborn.
1. Importing Libraries
Essential Python libraries are imported for:
● Data handling: pandas, numpy
● Text preprocessing: nltk
● Machine learning: sklearn
● Visualization: matplotlib, seaborn
2. Loading and Preparing the Dataset
● The SMS Spam Collection Dataset is loaded using pandas.
● The columns are renamed for clarity (v1 to label, v2 to text).
● Labels are mapped: "ham" → 0 and "spam" → 1.
3. Text Preprocessing
The text messages are cleaned using the following steps:
● Convert to lowercase
● Remove numbers, punctuation, and extra spaces
● Remove stopwords using [Link]
● Apply stemming or lemmatization if needed
A new column clean_text is added to hold the processed version of each message.
4. Feature Extraction (TF-IDF)
● The cleaned text is transformed into numerical vectors using TF-IDF (Term
Frequency–Inverse Document Frequency).
7
● This step converts words into weighted numerical features based on their
importance in the message and across the dataset.
5. Train-Test Split
● The dataset is split into training and testing sets using train_test_split() from
scikit-learn.
● Typically, 80% of the data is used for training and 20% for testing.
6. Model Training
Two classification models are trained:
● Logistic Regression: A statistical method that outputs a probability score for each
message being spam.
● Support Vector Machine (SVM): A robust classifier that finds the optimal
hyperplane to separate spam from ham, especially effective in high-dimensional
feature space.
7. Prediction and Evaluation
● Both models make predictions on the test data.
● Performance is evaluated using:
● Accuracy
● Precision
● Recall
● F1-score
● Confusion Matrix
● Visualizations (confusion matrices, bar charts) are created to clearly show model
performance.
8
8. Model Comparison
● Results from both models are compared to determine which performs better.
● In most cases, SVM performs slightly better, especially in handling imbalanced
data and high-dimensional text vectors.
Flow Chart
9
2.1 METHODS
This case study employs a structured machine learning pipeline to detect spam messages
using two classification algorithms—Logistic Regression and Support Vector Machines
(SVM). The methodology includes data collection, preprocessing, feature extraction, model
training, and evaluation.
1. Data Collection
● Used the SMS Spam Collection Dataset (5,572 messages labeled as
spam or ham)
● Imported using pandas
2. Data Preprocessing
● Renamed columns for clarity (v1 → label, v2 → text)
● Converted labels: ham → 0, spam → 1
● Applied text cleaning:
● Converted to lowercase
● Removed punctuation, numbers, and extra spaces
● Removed stopwords using NLTK
● (Optional) Applied stemming or lemmatization
3. Feature Extraction
● Used TF-IDF Vectorizer to convert cleaned text into numerical vectors
● Captures importance of words based on their frequency and rarity
4. Train-Test Split
● Split dataset into 80% training and 20% testing
● Used train_test_split() from scikit-learn
10
5. Model Training
● Trained Logistic Regression model for binary classification
● Trained SVM (Support Vector Machine) with linear kernel for
high-dimensional data
6. Model Evaluation
● Predicted labels on test set using both models
● Evaluated performance using:
● Accuracy
● Precision
● Recall
● F1-score
● Confusion Matrix
7. Comparison & Visualization
● Compared results of both models
● Visualized performance using bar plots and heatmaps
● Observed SVM performing slightly better in precision and recall
Logistic Regression (LR)
● Used for Binary Classification: Predicts whether an email is spam (1) or ham (0).
● Learns relationships between TF-IDF word features and the target label.
● Computes a weighted sum of input features:
𝑇
𝑧= θ 𝑥 + 𝑏
● Applies the sigmoid function to output a probability:
1
𝑃(𝑠𝑝𝑎𝑚) = −𝑧
1+ 𝑒
● If probability > 0.5 → Predicts spam, else ham.
11
● Trains using gradient descent to minimize log loss (cross-entropy).
● Best for interpretable models and fast training on large datasets.
Support Vector Machine (SVM)
● Also used for binary classification: separates spam vs ham.
● Works by finding the optimal hyperplane that maximizes the margin between classes.
● Uses a linear kernel because TF-IDF vectors are high-dimensional and sparse.
● Classifies using the decision function:
𝑇
𝑓(𝑥) = 𝑤 𝑥 + 𝑏
● Decision rule:
● If 𝑓(𝑥) ≥ 0 → spam
● If 𝑓(𝑥) < 0 → ham
● Works well with sparse, high-dimensional data like emails.
● More resistant to overfitting and better generalization in text classification.
12
2.2 ALGORITHM
The following steps outline the algorithm used to detect whether an email is spam or not spam
(ham) using Logistic Regression and Support Vector Machine (SVM):
Step-by-Step Algorithm
Step 1: Load the Dataset
● Import the dataset using pandas.
● Rename columns (v1 → label, v2 → text).
Step 2: Preprocess the Text Data
● Convert text to lowercase.
● Remove punctuation, numbers, and special characters.
● Remove stopwords using nltk.
● (Optional) Apply stemming or lemmatization.
Step 3: Encode the Labels
● Convert:
● "ham" → 0
● "spam" → 1
Step 4: Text Vectorization
● Apply TF-IDF Vectorizer to convert cleaned text into numeric feature vectors.
Step 5: Split the Dataset
● Use train_test_split() to divide data:
● Training Set (80%)
● Testing Set (20%)
13
Step 6: Train Models
● Logistic Regression Model:
● Train on training data.
● SVM Model:
● Train with a linear kernel.
Step 7: Make Predictions
● Predict spam/ham labels on the test data using both models.
Step 8: Evaluate Model Performance
● Use metrics:
● Accuracy
● Precision
● Recall
● F1-Score
● Confusion Matrix
Step 9: Compare the Models
● Compare Logistic Regression and SVM performance
● Identify the better model for spam detection.
14
1. Logistic Regression Algorithm (for Spam Detection)
Logistic Regression is used to model the probability that a given message is spam (1) or
ham (0). It fits a sigmoid curve to the input features and makes decisions based on the
probability threshold (usually 0.5).
Steps:
1. Input: Preprocessed email text converted into TF-IDF vectors.
2. Initialize weights (θ) and bias term.
3. Compute weighted sum:
𝑇
𝑧= θ 𝑥 + 𝑏
4. Apply Sigmoid function:
1
𝑃(𝑦 = 1| 𝑥) = −𝑧
1+ 𝑒
5. Classify message:
○ If P > 0.5, classify as spam (1)
○ Else, classify as ham (0)
6. Use cross-entropy loss and gradient descent to update weights:
1
𝐽(θ) = − 𝑚
∑(𝑦𝑙𝑜𝑔(𝑦) + (1 − 𝑦)𝑙𝑜𝑔(1 − 𝑦))
7. Train until convergence, then test on unseen data.
15
2. SVM Algorithm (for Spam Detection)
Support Vector Machine (SVM) attempts to find the optimal hyperplane that best
separates spam and ham messages in high-dimensional space created by the TF-IDF
features.
Steps:
1. Input: TF-IDF feature vectors of emails.
2. Transform problem into optimization:
● Maximize the margin between spam and ham classes.
3. Compute decision function:
𝑇
𝑓(𝑥) = 𝑤 𝑥 + 𝑏
4. Classify message:
● If f(x) ≥ 0, classify as spam (1)
● Else, classify as ham (0)
5. Optimization objective:
1 2 𝑇
𝑚𝑖𝑛 2
||𝑤|| 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 : 𝑦𝑖(𝑤 𝑥𝑖 + 𝑏) ≥ 1
6. Uses kernel trick (linear kernel in this case) for high-dimensional sparse text.
7. Trained model finds the best boundary to classify new emails.
16
2.3 SOURCE CODE
# Step 1: Import Required Libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
import nltk
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from [Link] import SVC
from [Link] import classification_report, confusion_matrix, accuracy_score
[Link]('stopwords')
from [Link] import stopwords
stopwords_set = set([Link]('english'))
# Step 2: Load the Dataset
# Dataset: [Link]
df = pd.read_csv("[Link]", encoding='latin-1')[['v1', 'v2']]
17
[Link] = ['label', 'text']
# Convert labels to binary
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
[Link]()
# Step 3: Data Cleaning and Preprocessing
def clean_text(text):
text = [Link]() # Lowercase
text = [Link](r'\d+', '', text) # Remove digits
text = [Link]([Link]('', '', [Link])) # Remove punctuation
text = ' '.join([word for word in [Link]() if word not in stopwords_set]) # Remove
stopwords
return text
df['clean_text'] = df['text'].apply(clean_text)
df[['text', 'clean_text']].head()
# Step 4: Feature Extraction with TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['clean_text']).toarray()
y = df['label_num'].values
18
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 5: Logistic Regression Model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_test)
print(" Logistic Regression Evaluation:")
print(classification_report(y_test, lr_preds))
print("Confusion Matrix:")
[Link](confusion_matrix(y_test, lr_preds), annot=True, fmt='d', cmap='Blues')
[Link]("Logistic Regression Confusion Matrix")
[Link]()
# Step 6: Support Vector Machine (SVM) Model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_test)
print(" SVM Evaluation:")
19
print(classification_report(y_test, svm_preds))
print("Confusion Matrix:")
[Link](confusion_matrix(y_test, svm_preds), annot=True, fmt='d',
cmap='Greens')
[Link]("SVM Confusion Matrix")
[Link]()
# Step 7: Accuracy Comparison
lr_acc = accuracy_score(y_test, lr_preds)
svm_acc = accuracy_score(y_test, svm_preds)
print(f"Logistic Regression Accuracy: {lr_acc:.4f}")
print(f"SVM Accuracy: {svm_acc:.4f}")
models = ['Logistic Regression', 'SVM']
scores = [lr_acc, svm_acc]
[Link](models, scores, color=['blue', 'green'])
[Link]('Accuracy')
[Link]('Model Comparison')
[Link](0.9, 1.0)
[Link]()
20
2.4 OUTPUT
Loading the Dataset = “[Link]”from Kaggle
A preview of the dataset (first 5 rows):
Cleaned Dataset
By removing spaces , converting text into lowercase, removing stopwords ,removal of
punctuation and removal of symbols.
After cleaning:
Training the Model
21
Logistic Regression Model
22
Support Vector Machine (SVM) Model
23
Comparison on Logistic Regression and SVM Model
The bar chart will show that both models are highly accurate, but SVM slightly
outperforms Logistic Regression.
Final Conclusion:
● SVM gives slightly better performance, especially in spam classification
(precision/recall).
● TF-IDF with either classifier works well for spam filtering.
24
3. CONCLUSION
In this case study, we successfully implemented and evaluated two machine learning
algorithms—Logistic Regression and Support Vector Machine (SVM)—for the task of email
spam detection. The objective was to classify messages as either spam or ham using textual
features derived from the content of the emails. By applying text preprocessing techniques and
using TF-IDF vectorization, we transformed raw messages into meaningful numerical data that
could be fed into classification models.
The models were trained and tested using a labeled dataset of real SMS/email messages. Both
classifiers achieved high accuracy, precision, and recall, demonstrating the effectiveness of
machine learning in solving real-world classification problems. While Logistic Regression
provided quick and interpretable results, the SVM model showed slightly better performance
in handling the high-dimensional and sparse nature of text data. The comparative analysis
highlighted that both models can be effectively used in spam detection, with SVM offering a
slight edge in precision and recall.
Overall, this study demonstrates the power and practicality of supervised learning methods in
the domain of cybersecurity and information filtering. It not only reinforces the importance of
preprocessing and feature engineering in text classification but also showcases how combining
traditional ML techniques with proper evaluation metrics can lead to reliable, scalable spam
detection systems. Future improvements could include the use of ensemble methods or deep
learning models like LSTM or BERT for even higher accuracy and adaptability.
25
4. BIBLIOGRAPHY
1. Mildenberger, T. (2015). Simon Rogers and Mark Girolami: A first course in machine
learning: CRC Press, Boca Raton, 2012, xx+ 285 pp., US $69.95, GB 36.99,€ 51.80,
ISBN 978-143982414-6.
2. Lantz, B. (2015). Machine learning with R (Vol. 452). Birmingham: Packt publishing.
3. Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning
(Vol. 4, No. 4, p. 738). New York: springer.
4. Jurafsky, D., & Martin, J. H. Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition.
5. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
6. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python:
analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".
7. Sarker, S. K., Bhattacharjee, R., Sufian, M. A., Ahamed, M. S., Talha, M. A., Tasnim,
F., ... & Adrita, S. T. (2025, February). Email Spam Detection Using Logistic
Regression and Explainable AI. In 2025 International Conference on Electrical,
Computer and Communication Engineering (ECCE) (pp. 1-6). IEEE.
8. Olatunji, S. O. (2019). Improved email spam detection model based on support vector
machines. Neural Computing and Applications, 31(3), 691-699.
9. Amayri, O., & Bouguila, N. (2010). A study of spam filtering using support vector
machines. Artificial Intelligence Review, 34(1), 73-108.
10.Fatima, R., Fareed, M. M. S., Ullah, S., Ahmad, G., & Mahmood, S. (2024). An
optimized approach for detection and classification of spam email’s using ensemble
methods. Wireless Personal Communications, 139(1), 347-373.