Email
Classification
Project
Presented by: Deepti Mishra
ID:2022AAPS0333H
BITS PILANI,Hyderabad campus
Intern at IDS Infotech
Introduction
The objective of this project is to detect spam emails using
machine learning.
In this project, I have:
Use a dataset of labeled emails with two columns: Text and
Spam. The Spam column contains 1 for spam emails and 0
for ham (non-spam) emails.
Implement text processing techniques to transform email
content into numerical features.
Apply a Logistic Regression model to classify emails as spam
or non-spam.
Evaluate the performance of the model and test it on new
email inputs.
Source of Dataset
Dataset Source:
The dataset used for this project is
sourced from Kaggle, a well-known
platform for data science and machine
learning datasets.
Dataset Details:
Our dataset contains two columns:
Text: The content of the email.
Spam: A binary label indicating
whether the email is spam (1) or non-
spam (0).
Feature Extraction
Text Data Processing:
Emails are processed using the TfidfVectorizer to convert text
data into numerical features.
This method helps in quantifying the importance of each word in
an email relative to the entire dataset.
TF-IDF (Term Frequency-Inverse Document Frequency):
Term Frequency (TF): Measures how frequently a term occurs in a
document.
Inverse Document Frequency (IDF): Measures how important a term
is within the entire dataset.
The combination of TF and IDF gives a score representing the
importance of each word in an email.
Transformation:
The text data is transformed into a TF-IDF matrix, which is then
used as input features for the machine learning model.
This matrix representation captures the significance of words in
emails, allowing the model to distinguish between spam and non-
spam emails effectively.
Data Preprocessing
Handling Missing Values:
Replace any missing values in the dataset with empty strings.
Ensures that the data is clean and ready for processing.
Label Encoding:
Encode the labels in the Spam column:
1 for spam emails.
0 for non-spam (ham) emails.
Converts categorical labels into numerical format suitable for machine learning algorithms.
Data Splitting:
Split the dataset into training and testing sets:
Training Set: 80% of the data is used to train the model.
Testing Set: 20% of the data is used to evaluate the model.
Text Vectorization:
Use TfidfVectorizer to convert the email text into numerical features:
Fit on Training Data: Learn the vocabulary and IDF from the training data.
Transform Training and Testing Data: Apply the learned vocabulary to convert text into TF-IDF
features.
Prepares data for effective machine learning model training.
Algorithm Choice
Algorithm Chosen: Logistic Regression
Why Logistic Regression?
Simple and Interpretable
Effective for Binary Classification
Probabilistic Interpretation
Works Well with Sparse Data
Accuracy and Output
Conclusion
Key Findings:
Our spam email detection model achieved high accuracy
rates of 99.59% on training data and 98.34% on test data,
demonstrating robust performance in classifying emails as
spam or non-spam.
The effectiveness of Logistic Regression combined with TF-
IDF vectorization has proven instrumental in identifying and
filtering out spam emails with precision.
Significance:
Accurate spam detection is crucial for enhancing email
security, protecting users from phishing attacks, malware,
and unwanted solicitations.
By leveraging machine learning techniques, we contribute
to a safer and more reliable communication environment for
individuals and businesses alike.
Thank you
very much!