0% found this document useful (0 votes)

23 views28 pages

Plagiarism Case Study

Uploaded by

Amrita P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views28 pages

Plagiarism Case Study

Uploaded by

Amrita P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

EMAIL SPAM DETECTION USING LOGISTIC

REGRESSION AND SVM

Submitted by:

AMRITA P : Reg No. 65424825001

2025
MSc COMPUTER SCIENCE
CASE STUDY REPORT

Name: .....………………………………………………………….

Reg. No: ……………………… Course: ………………………..

Semester: ……………………………Year: ……………………...

Subject: ……………………………………………………………

Date: ……………………………………………………………….

Teacher in Charge H.O. D External

CONTENTS

1. INTRODUCTION 1

1.1. ABOUT THE TOPIC 2

1.2. PROBLEM STATEMENT 3

1.3. OBJECTIVE 4

1.4. DATA DESCRIPTION 5

2. IMPLEMENTATION 6

2.1. METHODS 9

2.2. ALGORITHM 12

2.3. SOURCE CODE 16

2.4. OUTPUT 20

3. CONCLUSION 24

4. BIBLIOGRAPHY 25
1

1. INTRODUCTION

In the digital era, the exponential growth of online communication has brought with it the
challenge of handling unsolicited and potentially harmful messages, commonly known as
spam. Machine learning (ML) has become a powerful tool to address this issue by enabling
systems to automatically learn patterns from historical data and make intelligent decisions
without human intervention. In this case study, we explore how machine learning techniques
can be used to detect and classify spam emails, thereby improving the security and efficiency
of email communication.

The problem of spam detection falls under the category of supervised classification, a type of
machine learning where the model learns from labeled data—in this case, emails labeled as
"spam" or "ham" (not spam). While the term “regression” usually refers to predicting
continuous values, logistic regression is a form of regression used for binary classification
problems. It estimates the probability that a given input belongs to a particular category. In
parallel, we also implement Support Vector Machines (SVM), a robust classification algorithm
that constructs optimal decision boundaries (hyperplanes) to separate spam from ham in
high-dimensional space, especially effective in text data with many features.

This case study demonstrates the application of Logistic Regression and SVM for spam email
detection using a public dataset of SMS/email messages. We begin by preprocessing the text
data—removing noise, converting to lowercase, eliminating stopwords—and then
transforming it into numerical features using TF-IDF vectorization. Both models are trained
and evaluated using performance metrics such as accuracy, precision, recall, and F1-score. The
results show that both algorithms are effective, with SVM slightly outperforming Logistic
Regression. This study not only highlights the practical application of ML for cybersecurity
but also compares the strengths of two widely-used classification algorithms.
2
1.1 ABOUT THE TOPIC

With the rise of digital communication, email remains a critical tool for both personal and
professional use. However, this convenience comes with the constant threat of spam emails,
which can include everything from unwanted advertisements to phishing attacks and malware.
As the volume and complexity of spam increase, traditional rule-based filtering systems have
proven insufficient. This has led to the adoption of Machine Learning (ML) techniques to
develop intelligent systems that can automatically identify and filter spam with high accuracy.

Email spam detection is a classic example of a binary classification problem in machine

learning, where messages are classified as either "spam" or "not spam" (ham). This case study
applies two widely-used ML algorithms—Logistic Regression and Support Vector Machines
(SVM)—to solve the classification task. Logistic Regression is a probabilistic model that
predicts the likelihood of an email being spam based on word patterns. In contrast, SVM is a
geometric model that finds the optimal boundary between the two classes by analyzing word
vectors in high-dimensional space, making it especially effective for sparse data like text.

The project involves cleaning and transforming raw text emails into numerical data using
techniques like stopword removal and TF-IDF vectorization, followed by training both models
and evaluating their performance. The results are measured using metrics such as accuracy,
precision, recall, and F1-score. Through this approach, the case study demonstrates how
machine learning models can significantly improve the efficiency of spam detection, reduce
false positives, and enhance email security. It also provides a comparison of the performance
and suitability of Logistic Regression and SVM in real-world spam detection systems.
3
1.2 PROBLEM STATEMENT

Email communication has become a fundamental part of personal, academic, and business
activities. However, the widespread use of email has also led to a significant rise in spam
emails, which include unsolicited advertisements, fraudulent messages, phishing scams, and
links to malicious websites. These spam messages not only disrupt user experience and
productivity but also pose serious security risks. Traditional rule-based spam filters rely on
predefined keywords or blacklisted addresses, which are often rigid, outdated, and easy for
spammers to bypass.

To address this challenge, machine learning offers a dynamic and intelligent solution by
learning patterns from previously labeled email data and automatically identifying new spam
messages. The core objective of this problem is to build a system that can efficiently and
accurately classify incoming emails as either spam or ham (not spam). Since this is a binary
classification problem, algorithms such as Logistic Regression and Support Vector Machines
(SVM) are appropriate for modeling the relationship between email content and its
classification.

The problem also requires addressing challenges like text preprocessing, handling imbalanced
datasets, and converting textual data into numerical features using TF-IDF vectorization. The
aim is to evaluate and compare the performance of Logistic Regression and SVM on the same
dataset using metrics such as accuracy, precision, recall, and F1-score. Solving this problem
has real-world impact, as it can improve spam filtering in email systems, protect users from
harmful content, and enhance the overall quality of digital communication.
4
1.3 OBJECTIVES

● To understand and apply machine learning techniques for binary classification

problems, specifically in the context of detecting spam emails.
● To build and train two classification models—Logistic Regression and Support Vector
Machine (SVM)—on a labeled dataset of emails or SMS messages.
● To preprocess and clean the email text data, including removing punctuation,
stopwords, numbers, and converting text to lowercase.
● To convert raw email content into numerical features using TF-IDF (Term
Frequency–Inverse Document Frequency) vectorization for effective model training.
● To evaluate and compare the performance of Logistic Regression and SVM models
using accuracy, precision, recall, F1-score, and confusion matrix.
● To identify the most suitable algorithm for spam detection based on model
performance and dataset characteristics.
● To demonstrate the real-world applicability of spam detection systems in enhancing
communication security and user experience.
● To gain practical experience in applying supervised learning techniques to real-world
text classification problems using Python and machine learning libraries.
5
1.4 DATA DESCRIPTION

The dataset used in this case study is the SMS Spam Collection Dataset, a widely-used
benchmark dataset for spam detection tasks. It contains 5,572 real-world text messages, each
labeled as either “ham” (legitimate) or “spam” (unwanted or malicious). The dataset is
publicly available from sources like UCI Machine Learning Repository and Kaggle.

Column Name Description

v1 The label of the message :”ham”(non-spam) or “spam” (unwanted

email)

v2 The actual text content of the message (SMS or email body)

For ease of use, the columns are renamed during preprocessing:

● v1 → label
● v2 → text

The dataset is imbalanced, with approximately 87% ham messages and 13% spam messages,
which reflects real-world scenarios where spam is a minority class. During analysis, the label
column is also converted into numeric form:

● ham → 0
● spam → 1

To prepare the text data for machine learning models, the following preprocessing steps are
applied:

● Conversion to lowercase
● Removal of punctuation, numbers, and stopwords
● Text vectorization using TF-IDF, transforming raw text into meaningful numeric
features

The cleaned and vectorized data is then used to train and test both Logistic Regression and
SVM classifiers for effective spam detection.
6
2. IMPLEMENTATION

The implementation of the email spam detection system involves a complete machine learning
pipeline, from data loading to model evaluation. The project is implemented using Python in a
Jupyter Notebook environment with libraries such as pandas, nltk, scikit-learn, matplotlib, and
seaborn.

1. Importing Libraries

Essential Python libraries are imported for:

● Data handling: pandas, numpy

● Text preprocessing: nltk
● Machine learning: sklearn
● Visualization: matplotlib, seaborn

2. Loading and Preparing the Dataset

● The SMS Spam Collection Dataset is loaded using pandas.

● The columns are renamed for clarity (v1 to label, v2 to text).
● Labels are mapped: "ham" → 0 and "spam" → 1.

3. Text Preprocessing

The text messages are cleaned using the following steps:

● Convert to lowercase
● Remove numbers, punctuation, and extra spaces
● Remove stopwords using [Link]
● Apply stemming or lemmatization if needed

A new column clean_text is added to hold the processed version of each message.

4. Feature Extraction (TF-IDF)

● The cleaned text is transformed into numerical vectors using TF-IDF (Term
Frequency–Inverse Document Frequency).
7

● This step converts words into weighted numerical features based on their
importance in the message and across the dataset.

5. Train-Test Split

● The dataset is split into training and testing sets using train_test_split() from
scikit-learn.
● Typically, 80% of the data is used for training and 20% for testing.

6. Model Training

Two classification models are trained:

● Logistic Regression: A statistical method that outputs a probability score for each
message being spam.
● Support Vector Machine (SVM): A robust classifier that finds the optimal
hyperplane to separate spam from ham, especially effective in high-dimensional
feature space.

7. Prediction and Evaluation

● Both models make predictions on the test data.

● Performance is evaluated using:

● Accuracy
● Precision
● Recall
● F1-score
● Confusion Matrix

● Visualizations (confusion matrices, bar charts) are created to clearly show model
performance.
8

8. Model Comparison

● Results from both models are compared to determine which performs better.
● In most cases, SVM performs slightly better, especially in handling imbalanced
data and high-dimensional text vectors.

Flow Chart
9
2.1 METHODS

This case study employs a structured machine learning pipeline to detect spam messages
using two classification algorithms—Logistic Regression and Support Vector Machines
(SVM). The methodology includes data collection, preprocessing, feature extraction, model
training, and evaluation.

1. Data Collection

● Used the SMS Spam Collection Dataset (5,572 messages labeled as

spam or ham)
● Imported using pandas

2. Data Preprocessing

● Renamed columns for clarity (v1 → label, v2 → text)

● Converted labels: ham → 0, spam → 1
● Applied text cleaning:
● Converted to lowercase
● Removed punctuation, numbers, and extra spaces
● Removed stopwords using NLTK
● (Optional) Applied stemming or lemmatization

3. Feature Extraction

● Used TF-IDF Vectorizer to convert cleaned text into numerical vectors

● Captures importance of words based on their frequency and rarity

4. Train-Test Split

● Split dataset into 80% training and 20% testing

● Used train_test_split() from scikit-learn
10
5. Model Training

● Trained Logistic Regression model for binary classification

● Trained SVM (Support Vector Machine) with linear kernel for
high-dimensional data

6. Model Evaluation

● Predicted labels on test set using both models

● Evaluated performance using:
● Accuracy
● Precision
● Recall
● F1-score
● Confusion Matrix

7. Comparison & Visualization

● Compared results of both models

● Visualized performance using bar plots and heatmaps
● Observed SVM performing slightly better in precision and recall

Logistic Regression (LR)

● Used for Binary Classification: Predicts whether an email is spam (1) or ham (0).
● Learns relationships between TF-IDF word features and the target label.
● Computes a weighted sum of input features:
𝑇
𝑧= θ 𝑥 + 𝑏
● Applies the sigmoid function to output a probability:
1
𝑃(𝑠𝑝𝑎𝑚) = −𝑧
1+ 𝑒

● If probability > 0.5 → Predicts spam, else ham.

● Trains using gradient descent to minimize log loss (cross-entropy).

● Best for interpretable models and fast training on large datasets.

Support Vector Machine (SVM)

● Also used for binary classification: separates spam vs ham.

● Works by finding the optimal hyperplane that maximizes the margin between classes.
● Uses a linear kernel because TF-IDF vectors are high-dimensional and sparse.
● Classifies using the decision function:
𝑇
𝑓(𝑥) = 𝑤 𝑥 + 𝑏
● Decision rule:

● If 𝑓(𝑥) ≥ 0 → spam
● If 𝑓(𝑥) < 0 → ham

● Works well with sparse, high-dimensional data like emails.

● More resistant to overfitting and better generalization in text classification.
12
2.2 ALGORITHM

The following steps outline the algorithm used to detect whether an email is spam or not spam
(ham) using Logistic Regression and Support Vector Machine (SVM):

Step-by-Step Algorithm

Step 1: Load the Dataset

● Import the dataset using pandas.

● Rename columns (v1 → label, v2 → text).

Step 2: Preprocess the Text Data

● Convert text to lowercase.

● Remove punctuation, numbers, and special characters.
● Remove stopwords using nltk.
● (Optional) Apply stemming or lemmatization.

Step 3: Encode the Labels

● Convert:
● "ham" → 0
● "spam" → 1

Step 4: Text Vectorization

● Apply TF-IDF Vectorizer to convert cleaned text into numeric feature vectors.

Step 5: Split the Dataset

● Use train_test_split() to divide data:

● Training Set (80%)
● Testing Set (20%)
13
Step 6: Train Models

● Logistic Regression Model:

● Train on training data.
● SVM Model:
● Train with a linear kernel.

Step 7: Make Predictions

● Predict spam/ham labels on the test data using both models.

Step 8: Evaluate Model Performance

● Use metrics:
● Accuracy
● Precision
● Recall
● F1-Score
● Confusion Matrix

Step 9: Compare the Models

● Compare Logistic Regression and SVM performance

● Identify the better model for spam detection.
14

1. Logistic Regression Algorithm (for Spam Detection)

Logistic Regression is used to model the probability that a given message is spam (1) or
ham (0). It fits a sigmoid curve to the input features and makes decisions based on the
probability threshold (usually 0.5).

Steps:

1. Input: Preprocessed email text converted into TF-IDF vectors.

2. Initialize weights (θ) and bias term.
3. Compute weighted sum:
𝑇
𝑧= θ 𝑥 + 𝑏
4. Apply Sigmoid function:
1
𝑃(𝑦 = 1| 𝑥) = −𝑧
1+ 𝑒

5. Classify message:

○ If P > 0.5, classify as spam (1)
○ Else, classify as ham (0)
6. Use cross-entropy loss and gradient descent to update weights:

1
𝐽(θ) = − 𝑚
∑(𝑦𝑙𝑜𝑔(𝑦) + (1 − 𝑦)𝑙𝑜𝑔(1 − 𝑦))

7. Train until convergence, then test on unseen data.

2. SVM Algorithm (for Spam Detection)

Support Vector Machine (SVM) attempts to find the optimal hyperplane that best
separates spam and ham messages in high-dimensional space created by the TF-IDF
features.

Steps:

1. Input: TF-IDF feature vectors of emails.

2. Transform problem into optimization:
● Maximize the margin between spam and ham classes.
3. Compute decision function:
𝑇
𝑓(𝑥) = 𝑤 𝑥 + 𝑏
4. Classify message:
● If f(x) ≥ 0, classify as spam (1)
● Else, classify as ham (0)
5. Optimization objective:
1 2 𝑇
𝑚𝑖𝑛 2
||𝑤|| 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 : 𝑦𝑖(𝑤 𝑥𝑖 + 𝑏) ≥ 1

6. Uses kernel trick (linear kernel in this case) for high-dimensional sparse text.
7. Trained model finds the best boundary to classify new emails.
16
2.3 SOURCE CODE

# Step 1: Import Required Libraries

import pandas as pd

import numpy as np

import [Link] as plt

import seaborn as sns

import nltk

import string

import re

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from [Link] import SVC

from [Link] import classification_report, confusion_matrix, accuracy_score

[Link]('stopwords')

from [Link] import stopwords

stopwords_set = set([Link]('english'))

# Step 2: Load the Dataset

# Dataset: [Link]

df = pd.read_csv("[Link]", encoding='latin-1')[['v1', 'v2']]

17
[Link] = ['label', 'text']

# Convert labels to binary

df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

[Link]()

# Step 3: Data Cleaning and Preprocessing

def clean_text(text):

text = [Link]() # Lowercase

text = [Link](r'\d+', '', text) # Remove digits

text = [Link]([Link]('', '', [Link])) # Remove punctuation

text = ' '.join([word for word in [Link]() if word not in stopwords_set]) # Remove
stopwords

return text

df['clean_text'] = df['text'].apply(clean_text)

df[['text', 'clean_text']].head()

# Step 4: Feature Extraction with TF-IDF

tfidf = TfidfVectorizer(max_features=3000)

X = tfidf.fit_transform(df['clean_text']).toarray()

y = df['label_num'].values
18

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Logistic Regression Model

lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)

lr_preds = lr_model.predict(X_test)

print(" Logistic Regression Evaluation:")

print(classification_report(y_test, lr_preds))

print("Confusion Matrix:")

[Link](confusion_matrix(y_test, lr_preds), annot=True, fmt='d', cmap='Blues')

[Link]("Logistic Regression Confusion Matrix")

[Link]()

# Step 6: Support Vector Machine (SVM) Model

svm_model = SVC(kernel='linear')

svm_model.fit(X_train, y_train)

svm_preds = svm_model.predict(X_test)

print(" SVM Evaluation:")

19
print(classification_report(y_test, svm_preds))

print("Confusion Matrix:")

[Link](confusion_matrix(y_test, svm_preds), annot=True, fmt='d',

cmap='Greens')

[Link]("SVM Confusion Matrix")

[Link]()

# Step 7: Accuracy Comparison

lr_acc = accuracy_score(y_test, lr_preds)

svm_acc = accuracy_score(y_test, svm_preds)

print(f"Logistic Regression Accuracy: {lr_acc:.4f}")

print(f"SVM Accuracy: {svm_acc:.4f}")

models = ['Logistic Regression', 'SVM']

scores = [lr_acc, svm_acc]

[Link](models, scores, color=['blue', 'green'])

[Link]('Accuracy')

[Link]('Model Comparison')

[Link](0.9, 1.0)

[Link]()
20
2.4 OUTPUT

Loading the Dataset = “[Link]”from Kaggle

A preview of the dataset (first 5 rows):

Cleaned Dataset

By removing spaces , converting text into lowercase, removing stopwords ,removal of

punctuation and removal of symbols.
After cleaning:

Training the Model

Logistic Regression Model

Support Vector Machine (SVM) Model

Comparison on Logistic Regression and SVM Model

The bar chart will show that both models are highly accurate, but SVM slightly
outperforms Logistic Regression.

Final Conclusion:

● SVM gives slightly better performance, especially in spam classification

(precision/recall).

● TF-IDF with either classifier works well for spam filtering.

24
3. CONCLUSION

In this case study, we successfully implemented and evaluated two machine learning
algorithms—Logistic Regression and Support Vector Machine (SVM)—for the task of email
spam detection. The objective was to classify messages as either spam or ham using textual
features derived from the content of the emails. By applying text preprocessing techniques and
using TF-IDF vectorization, we transformed raw messages into meaningful numerical data that
could be fed into classification models.

The models were trained and tested using a labeled dataset of real SMS/email messages. Both
classifiers achieved high accuracy, precision, and recall, demonstrating the effectiveness of
machine learning in solving real-world classification problems. While Logistic Regression
provided quick and interpretable results, the SVM model showed slightly better performance
in handling the high-dimensional and sparse nature of text data. The comparative analysis
highlighted that both models can be effectively used in spam detection, with SVM offering a
slight edge in precision and recall.

Overall, this study demonstrates the power and practicality of supervised learning methods in
the domain of cybersecurity and information filtering. It not only reinforces the importance of
preprocessing and feature engineering in text classification but also showcases how combining
traditional ML techniques with proper evaluation metrics can lead to reliable, scalable spam
detection systems. Future improvements could include the use of ensemble methods or deep
learning models like LSTM or BERT for even higher accuracy and adaptability.
25
4. BIBLIOGRAPHY

1. Mildenberger, T. (2015). Simon Rogers and Mark Girolami: A first course in machine
learning: CRC Press, Boca Raton, 2012, xx+ 285 pp., US $69.95, GB 36.99,€ 51.80,
ISBN 978-143982414-6.
2. Lantz, B. (2015). Machine learning with R (Vol. 452). Birmingham: Packt publishing.
3. Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning
(Vol. 4, No. 4, p. 738). New York: springer.
4. Jurafsky, D., & Martin, J. H. Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition.
5. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
6. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python:
analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".
7. Sarker, S. K., Bhattacharjee, R., Sufian, M. A., Ahamed, M. S., Talha, M. A., Tasnim,
F., ... & Adrita, S. T. (2025, February). Email Spam Detection Using Logistic
Regression and Explainable AI. In 2025 International Conference on Electrical,
Computer and Communication Engineering (ECCE) (pp. 1-6). IEEE.
8. Olatunji, S. O. (2019). Improved email spam detection model based on support vector
machines. Neural Computing and Applications, 31(3), 691-699.
9. Amayri, O., & Bouguila, N. (2010). A study of spam filtering using support vector
machines. Artificial Intelligence Review, 34(1), 73-108.
10.Fatima, R., Fareed, M. M. S., Ullah, S., Ahmad, G., & Mahmood, S. (2024). An
optimized approach for detection and classification of spam email’s using ensemble
methods. Wireless Personal Communications, 139(1), 347-373.

Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
No ratings yet
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
13 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
Final PPT
No ratings yet
Final PPT
18 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Zoom
No ratings yet
Zoom
20 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Email
No ratings yet
Email
27 pages
Machine Learning for Email Spam Detection
No ratings yet
Machine Learning for Email Spam Detection
9 pages
Mini Project Final 10,42,52
No ratings yet
Mini Project Final 10,42,52
39 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
Email Classification with Machine Learning
No ratings yet
Email Classification with Machine Learning
22 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
EmailSpam
No ratings yet
EmailSpam
14 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
5 pages
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
No ratings yet
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
3 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
PPT
0% (1)
PPT
15 pages
$RB0DCAN
No ratings yet
$RB0DCAN
10 pages
Data Science Report
No ratings yet
Data Science Report
33 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Deep Learning for Email Spam Detection
No ratings yet
Deep Learning for Email Spam Detection
4 pages
Spam Detection Synopsis
No ratings yet
Spam Detection Synopsis
8 pages
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
No ratings yet
Investigating Evasive Techniques in Sms Spam Filtering A Comparative Analysis of Machine Learning Models Ijariie26436
10 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
Ai Project
No ratings yet
Ai Project
8 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Project 2
No ratings yet
Project 2
10 pages
ML Lab
No ratings yet
ML Lab
13 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Spam Detection with Logistic Regression
No ratings yet
Spam Detection with Logistic Regression
4 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
02 JCCE2202192 Online
No ratings yet
02 JCCE2202192 Online
5 pages
Email Spam Detection with ML
No ratings yet
Email Spam Detection with ML
5 pages
Automated Spam Detection Using ML
No ratings yet
Automated Spam Detection Using ML
4 pages
Detecting Spam in Emails
No ratings yet
Detecting Spam in Emails
12 pages
BT-3435 Ali
No ratings yet
BT-3435 Ali
49 pages
Final Report Spam Classifier
100% (1)
Final Report Spam Classifier
24 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
No ratings yet
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
6 pages
$RVJ44FQ
No ratings yet
$RVJ44FQ
13 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Aiproject 2
No ratings yet
Aiproject 2
4 pages
Published Paper
No ratings yet
Published Paper
9 pages
Email Spam Final
No ratings yet
Email Spam Final
32 pages
Email Report
No ratings yet
Email Report
15 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Fin Irjmets1697888326
No ratings yet
Fin Irjmets1697888326
4 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
Interleave Division Multiple Access (Idma)
No ratings yet
Interleave Division Multiple Access (Idma)
37 pages
Automata Theory and Formal Languages
No ratings yet
Automata Theory and Formal Languages
29 pages
On Tables of Random Numbers: Theoretical Computer Science
No ratings yet
On Tables of Random Numbers: Theoretical Computer Science
9 pages
Untitled Document
No ratings yet
Untitled Document
16 pages
Weaponizing Middleboxes For TCP Reflected Amplification
No ratings yet
Weaponizing Middleboxes For TCP Reflected Amplification
17 pages
Ayuda para Ensayos de Biología
100% (2)
Ayuda para Ensayos de Biología
7 pages
Ensayo Sobre Albert Einstein
100% (1)
Ensayo Sobre Albert Einstein
4 pages
17 Recruiting Advanced Analytics
No ratings yet
17 Recruiting Advanced Analytics
29 pages
Company Information Sheet Template
0% (1)
Company Information Sheet Template
3 pages
Grade 4 Computer Literacy Overview
No ratings yet
Grade 4 Computer Literacy Overview
5 pages
Online Exam Guide for Candidates
No ratings yet
Online Exam Guide for Candidates
5 pages
PRP Operating Instructions
No ratings yet
PRP Operating Instructions
48 pages
ITNE2005R Lab Tutorial 2 Administrative Access - Task 2
No ratings yet
ITNE2005R Lab Tutorial 2 Administrative Access - Task 2
12 pages
Linux 1
No ratings yet
Linux 1
12 pages
Research Article Secure Data Transmission Using Quantum Cryptography in Fog Computing
No ratings yet
Research Article Secure Data Transmission Using Quantum Cryptography in Fog Computing
8 pages
Infinity Open Enrollment - Infinity Testing Open Enrollment What Is Infinity - Infinity Is Cloud - Studocu
No ratings yet
Infinity Open Enrollment - Infinity Testing Open Enrollment What Is Infinity - Infinity Is Cloud - Studocu
1 page
Cyber Security 4
No ratings yet
Cyber Security 4
10 pages
Anshul Gupta: Software Engineer Resume
No ratings yet
Anshul Gupta: Software Engineer Resume
1 page
Push Pull July 1997 Issue of Webmaster Magazine
No ratings yet
Push Pull July 1997 Issue of Webmaster Magazine
6 pages
21csc302j CN Syllabus
0% (1)
21csc302j CN Syllabus
2 pages
PowerPoint Quick Reference Guide
No ratings yet
PowerPoint Quick Reference Guide
5 pages
Full Stack Developer Masterclass New PDF
No ratings yet
Full Stack Developer Masterclass New PDF
10 pages
Introduction to Cloud Security Overview
No ratings yet
Introduction to Cloud Security Overview
10 pages
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
No ratings yet
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
40 pages
Computer Operator Syllabus
100% (3)
Computer Operator Syllabus
7 pages
Install SQL Server 2022 Developer Edition and SSMS
No ratings yet
Install SQL Server 2022 Developer Edition and SSMS
18 pages
Sistem Informasi E-Marketplace Penyewaan Dan Penjualan Perlengkapan Kostum Ceremonial Di Kota Sampit Berbasis Web
No ratings yet
Sistem Informasi E-Marketplace Penyewaan Dan Penjualan Perlengkapan Kostum Ceremonial Di Kota Sampit Berbasis Web
5 pages
Lec-15-Network Security Fundamentals & Cryptographic Techniques
No ratings yet
Lec-15-Network Security Fundamentals & Cryptographic Techniques
28 pages
Needham-Schroeder Protocol Vulnerabilities
No ratings yet
Needham-Schroeder Protocol Vulnerabilities
4 pages
Azure Fundamentals Exam Guide
No ratings yet
Azure Fundamentals Exam Guide
8 pages
Selectra Communiation
100% (1)
Selectra Communiation
30 pages
WWW Prolific Com TW US ShowProduct Aspx Pcid 41 Showlevel 00
No ratings yet
WWW Prolific Com TW US ShowProduct Aspx Pcid 41 Showlevel 00
2 pages
Computer Office Application Syllabus
No ratings yet
Computer Office Application Syllabus
14 pages
Fatic
No ratings yet
Fatic
20 pages

Plagiarism Case Study

Uploaded by

Plagiarism Case Study

Uploaded by

EMAIL SPAM DETECTION USING LOGISTIC

REGRESSION AND SVM

AMRITA P : Reg No. 65424825001

Reg. No: ……………………… Course: ………………………..

Semester: ……………………………Year: ……………………...

Teacher in Charge​ H.O. D​ External

1.1.​ ABOUT THE TOPIC​ 2

1.2.​ PROBLEM STATEMENT​ 3

1.4.​ DATA DESCRIPTION​ 5

2.3. SOURCE CODE 16

Email spam detection is a classic example of a binary classification problem in machine

●​ To understand and apply machine learning techniques for binary classification

Column Name Description

v1 The label of the message :”ham”(non-spam) or “spam” (unwanted

v2 The actual text content of the message (SMS or email body)

For ease of use, the columns are renamed during preprocessing:

Essential Python libraries are imported for:

●​ Data handling: pandas, numpy

2. Loading and Preparing the Dataset

●​ The SMS Spam Collection Dataset is loaded using pandas.

The text messages are cleaned using the following steps:

4. Feature Extraction (TF-IDF)

Two classification models are trained:

7. Prediction and Evaluation

●​ Both models make predictions on the test data.

1.​ Data Collection​

●​ Used the SMS Spam Collection Dataset (5,572 messages labeled as

2.​ Data Preprocessing​

●​ Renamed columns for clarity (v1 → label, v2 → text)

3.​ Feature Extraction​

●​ Used TF-IDF Vectorizer to convert cleaned text into numerical vectors

4.​ Train-Test Split​

●​ Split dataset into 80% training and 20% testing

●​ Trained Logistic Regression model for binary classification

6.​ Model Evaluation

●​ Predicted labels on test set using both models

7.​ Comparison & Visualization​

●​ Compared results of both models

Logistic Regression (LR)

●​ If probability > 0.5 → Predicts spam, else ham.​

●​ Trains using gradient descent to minimize log loss (cross-entropy).

Support Vector Machine (SVM)

●​ Also used for binary classification: separates spam vs ham.

●​ Works well with sparse, high-dimensional data like emails.

Step 1: Load the Dataset

●​ Import the dataset using pandas.

Step 2: Preprocess the Text Data

●​ Convert text to lowercase.

Step 3: Encode the Labels

Step 4: Text Vectorization

Step 5: Split the Dataset

●​ Use train_test_split() to divide data:

●​ Logistic Regression Model:

Step 7: Make Predictions

●​ Predict spam/ham labels on the test data using both models.

Step 8: Evaluate Model Performance

Step 9: Compare the Models

●​ Compare Logistic Regression and SVM performance

1. Logistic Regression Algorithm (for Spam Detection)

1.​ Input: Preprocessed email text converted into TF-IDF vectors.

5.​ Classify message:

7.​ Train until convergence, then test on unseen data.

2. SVM Algorithm (for Spam Detection)

1.​ Input: TF-IDF feature vectors of emails.

# Step 1: Import Required Libraries

import [Link] as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from [Link] import SVC

from [Link] import classification_report, confusion_matrix, accuracy_score

from [Link] import stopwords

# Step 2: Load the Dataset

df = pd.read_csv("[Link]", encoding='latin-1')[['v1', 'v2']]

# Convert labels to binary

Teacher in Charge H.O. D External

1.1. ABOUT THE TOPIC 2

1.2. PROBLEM STATEMENT 3

1.4. DATA DESCRIPTION 5

● To understand and apply machine learning techniques for binary classification

● Data handling: pandas, numpy

● The SMS Spam Collection Dataset is loaded using pandas.

● Both models make predictions on the test data.

1. Data Collection

● Used the SMS Spam Collection Dataset (5,572 messages labeled as

2. Data Preprocessing

● Renamed columns for clarity (v1 → label, v2 → text)

3. Feature Extraction

● Used TF-IDF Vectorizer to convert cleaned text into numerical vectors

4. Train-Test Split

● Split dataset into 80% training and 20% testing

● Trained Logistic Regression model for binary classification

6. Model Evaluation

● Predicted labels on test set using both models

7. Comparison & Visualization

● Compared results of both models

● If probability > 0.5 → Predicts spam, else ham.

● Trains using gradient descent to minimize log loss (cross-entropy).

● Also used for binary classification: separates spam vs ham.

● Works well with sparse, high-dimensional data like emails.

● Import the dataset using pandas.

● Convert text to lowercase.

● Use train_test_split() to divide data:

● Logistic Regression Model:

● Predict spam/ham labels on the test data using both models.

● Compare Logistic Regression and SVM performance

1. Input: Preprocessed email text converted into TF-IDF vectors.

5. Classify message:

7. Train until convergence, then test on unseen data.

1. Input: TF-IDF feature vectors of emails.

● SVM gives slightly better performance, especially in spam classification

● TF-IDF with either classifier works well for spam filtering.