Email Classification Project

Uploaded by

f20220333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views9 pages

Email Classification Project

Uploaded by

f20220333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Email

Classification
Project
Presented by: Deepti Mishra
ID:2022AAPS0333H
BITS PILANI,Hyderabad campus
Intern at IDS Infotech
Introduction
The objective of this project is to detect spam emails using
machine learning.
In this project, I have:
Use a dataset of labeled emails with two columns: Text and
Spam. The Spam column contains 1 for spam emails and 0
for ham (non-spam) emails.
Implement text processing techniques to transform email
content into numerical features.
Apply a Logistic Regression model to classify emails as spam
or non-spam.
Evaluate the performance of the model and test it on new
email inputs.
Source of Dataset
Dataset Source:
The dataset used for this project is
sourced from Kaggle, a well-known
platform for data science and machine
learning datasets.

Dataset Details:
Our dataset contains two columns:
Text: The content of the email.
Spam: A binary label indicating
whether the email is spam (1) or non-
spam (0).
Feature Extraction
Text Data Processing:
Emails are processed using the TfidfVectorizer to convert text
data into numerical features.
This method helps in quantifying the importance of each word in
an email relative to the entire dataset.
TF-IDF (Term Frequency-Inverse Document Frequency):
Term Frequency (TF): Measures how frequently a term occurs in a
document.
Inverse Document Frequency (IDF): Measures how important a term
is within the entire dataset.
The combination of TF and IDF gives a score representing the
importance of each word in an email.
Transformation:
The text data is transformed into a TF-IDF matrix, which is then
used as input features for the machine learning model.
This matrix representation captures the significance of words in
emails, allowing the model to distinguish between spam and non-
spam emails effectively.
Data Preprocessing
Handling Missing Values:
Replace any missing values in the dataset with empty strings.
Ensures that the data is clean and ready for processing.
Label Encoding:
Encode the labels in the Spam column:
1 for spam emails.
0 for non-spam (ham) emails.
Converts categorical labels into numerical format suitable for machine learning algorithms.
Data Splitting:
Split the dataset into training and testing sets:
Training Set: 80% of the data is used to train the model.
Testing Set: 20% of the data is used to evaluate the model.
Text Vectorization:
Use TfidfVectorizer to convert the email text into numerical features:
Fit on Training Data: Learn the vocabulary and IDF from the training data.
Transform Training and Testing Data: Apply the learned vocabulary to convert text into TF-IDF
features.
Prepares data for effective machine learning model training.
Algorithm Choice
Algorithm Chosen: Logistic Regression

Why Logistic Regression?

Simple and Interpretable
Effective for Binary Classification
Probabilistic Interpretation
Works Well with Sparse Data
Accuracy and Output
Conclusion
Key Findings:
Our spam email detection model achieved high accuracy
rates of 99.59% on training data and 98.34% on test data,
demonstrating robust performance in classifying emails as
spam or non-spam.
The effectiveness of Logistic Regression combined with TF-
IDF vectorization has proven instrumental in identifying and
filtering out spam emails with precision.

Significance:
Accurate spam detection is crucial for enhancing email
security, protecting users from phishing attacks, malware,
and unwanted solicitations.
By leveraging machine learning techniques, we contribute
to a safer and more reliable communication environment for
individuals and businesses alike.
Thank you
very much!

It - Stephen King's PDF
80% (10)
It - Stephen King's PDF
588 pages
Secret Code Samsung
89% (38)
Secret Code Samsung
3 pages
Open Deed of Sale of A Motor Vehicle
81% (606)
Open Deed of Sale of A Motor Vehicle
1 page
Sim Owner Details - Pakistan No #1 Number Information System 2025
56% (16)
Sim Owner Details - Pakistan No #1 Number Information System 2025
3 pages
All Format
91% (32)
All Format
1 page
1500 Vocabulary Words
78% (112)
1500 Vocabulary Words
27 pages
میری گرم فیملی
79% (48)
میری گرم فیملی
133 pages
XXX Archita Phukan Viral Video Original XXX VIDEOS
8% (12)
XXX Archita Phukan Viral Video Original XXX VIDEOS
4 pages
Big Book of Sex
39% (134)
Big Book of Sex
386 pages
Earseus Key
50% (16)
Earseus Key
4 pages
Microsoft Office 2007 Activation Keys
85% (34)
Microsoft Office 2007 Activation Keys
2 pages
XXXX XXXXXXXX: X X X X X XX
60% (5)
XXXX XXXXXXXX: X X X X X XX
2 pages
NADANPENKODI - Malayalam Kambi Kathakal
60% (10)
NADANPENKODI - Malayalam Kambi Kathakal
8 pages
Telugu Family Sex Stories Collection
67% (102)
Telugu Family Sex Stories Collection
157 pages
Sample Research Paper PDF
90% (21)
Sample Research Paper PDF
36 pages
Chemistry (Annual Reports - Vol.59-1962)
100% (8)
Chemistry (Annual Reports - Vol.59-1962)
576 pages
All Numbers
68% (19)
All Numbers
59 pages
50 Numerical Questions On Electricity Class 10
89% (82)
50 Numerical Questions On Electricity Class 10
49 pages
Corel Draw X7 Serial Number & Activation Code
58% (43)
Corel Draw X7 Serial Number & Activation Code
1 page
Carbon and Its Compound (Prashant Kirad)
91% (272)
Carbon and Its Compound (Prashant Kirad)
21 pages
Telugu Boothu Kathala 24 PDF
77% (13)
Telugu Boothu Kathala 24 PDF
20 pages
Mineral and Energy Resources (Prashant Kirad)
92% (254)
Mineral and Energy Resources (Prashant Kirad)
20 pages
Telugu Boothu Kathala 5
67% (18)
Telugu Boothu Kathala 5
33 pages
Manufacturing Industries (Prashant Kirad)
91% (120)
Manufacturing Industries (Prashant Kirad)
22 pages
Uveit Foster
50% (6)
Uveit Foster
954 pages
R. D. Sharma Class 9th Book PDF - Unlocked
82% (72)
R. D. Sharma Class 9th Book PDF - Unlocked
464 pages
Agriculture (Prashant Kirad)
90% (220)
Agriculture (Prashant Kirad)
22 pages
Obligations and Contracts Hector de Leon
80% (81)
Obligations and Contracts Hector de Leon
905 pages
EFG Hermes - 21dec2022
No ratings yet
EFG Hermes - 21dec2022
54 pages
Casein Content in Milk Samples Study
89% (502)
Casein Content in Milk Samples Study
10 pages

Email Classification Project

Uploaded by

Email Classification Project

Uploaded by

Email

Why Logistic Regression?

You might also like