You are on page 1of 17

BADMUS HIKMAT ADESEWA

125/18/2/0046
CYBER SECURITY
DEPARTMENT OF COMPUTER SCIENCES
SUPERVISED BY: Dr. ABIMBOLA R.O

30th MAY 2022


PROJECT TITLE:
DEVELOPMENT OF AN EMAIL SPAM DETECTION
MODEL FOR IDENTIFYING MALICIOUS ACTIVITIES
USING MACHINE LEARNING TECHNIQUES
PROJECT OUTLINE
 Background of the study
Statement of the problem
Aims and Objectives
Significance of the study
Justification of the study
Gap analysis
Methodology
Theoretical concept
Architecture
References
BACKGROUND OF THE STUDY
In recent years, the Internet has been a key contributor to the integral part of the world that makes human
life and activities more secure. Increased usage of the internet results in an enormous number of email users’
day by day. This action has resulted in many malicious activities whereby unsolicited bulk email messages
known as spam come into play (Thashina et al., 2022). Email is a substantial platform for user
communication simply called an electronic messaging framework which transmits messages from one user
to another. Many platforms are available for users to share information anywhere across the world. Among
all information sharing mediums, email is the simplest, cheapest, and the most rapid method of information
sharing worldwide. But, due to their simplicity, emails are vulnerable to different kinds of attacks, and the
most common and dangerous one is spam (Naeem et al., 2022).
STATEMENT OF THE PROBLEM
(Hossam et al., 2019) propounded that organizations and individuals continue to experience
malicious activities such as identity theft, stealing of credentials, phishing and all others all as a
result of email spam. Email spam has been a major concern in this evolving Information
Technology world which poses danger and high level of risk to many sectors such as academic,
financial, industrial and many others. The aftermath of these malicious activities has resulted in
degradation of the modes of operations of the affected sector, loss of customer credentials and
data, and these credentials could be used by an attacker for malevolent purposes. The proposed
system would be able to eradicate all of these malicious acts by developing a system that would
be able to detect email spam.
AIMS AND OBJECTIVES
The aim of this study is to develop a system that would be able to detect spam emails to prevent
malicious activities while the specific objectives are:

i. Collection of Dataset

ii. Formulate a machine learning model

iii. Simulate the machine learning model in II

iv. Validate the simulated model and Implement a prototype


SIGNIFICANCE OF THE STUDY
This study is intended to prevent organizations, individuals and other sectors such as economic,
financial etc. from malicious activities involving identity theft, stealing of credentials, phishing
and many others related to email spam through the development of email spam detection system.
The system would be able to detect any email that contains spam which could be of harmful
content
JUSTIFICATION OF THE STUDY
From various researches carried out by (Suparna et al., 2021), and (Hanif et al., 2022). It has been
brought to notice that email users which could be individuals or organizations suffer losses from
attacks that are known or said to be forms of email attacks which can either be identity theft, or
phishing. This attack poses the important data, information and even credentials of email users to
a high level of risk and danger whereby the attacker could spoof users’ credentials then use it for
malevolent purposes which could either destroy the data or information, tarnish the image of the
user or organizations and many other acts. This system would help mitigate these attacks by
scanning any incoming mails and detecting if it is spam or ham which would keep an email user
safe from any form of email attack. This study would also be able to help reduce the numbers of
users, individuals and organizations that gets attacked through spam emails
GAP ANALYSIS
Authors & Year Dataset Classification Approach Merits Gaps Results

(Yuliya et al., 4,360 non-spam and Logistic Regression (LR), Presented a comparative analysis of For extracting important Accuracy-0.99
2021) 1,368 spam samples from Naïve Bayes (NB), K-Nearest different ML algorithms features, better DL-based Precision-0.97
the Kaggle Dataset Neighbor (K-NN) and Decision feature learning Recall-0.99
Trees (DT) algorithms can be used F-measure-0.98

(Tushaar et al., Spam Assassin and Random Forest (RF) Classifier conducted a comprehensive absence of an effective accuracy of 98.4% on
2020) Phishing corpus comparison research using different
state-of-the-art machine learning strategy for dealing with ham− spam dataset
classifiers to aid UBE filtering and security assaults on UBE (AUPRC of 99.8%) and
classification
filters, inability of present 99.4% on
UBE filters to deal with ham−phishing dataset
concept drift, and lack of (AUPRC of 99.9%)
effective UBE filters that
make use of graphical
characteristics
GAP ANALYSIS CONTD…
(Bilge & Bahriye, 2020) Enron, CSDMC2010, and Artificial bee colony proposed a novel spam In comparison to other a 99.25% success rate on
Turkish Email algorithm with a logistic detection method methods, it has a high
regression classification computational complexity the Turkish Email dataset
(except for MLPs, CNNs, and a 98.70% success rate
and DBB-RDNN-ReL
on the CSDMC2010
dataset
(Mahmoud et al., 2021) Support Vector Machine, Naïve Bayes, J48, Support
Dataset records from UCI demonstrates and reviews unexpected rise in volume
Artificial Neural Network, vector machine and ANN
machine learning j48, and Naïve Bayes the performance evaluation of false positives and false methods has an accuracy of
92.8%, 91.8%, 93.91% and
repository. The dataset of the most popular and negatives was not properly
91.05% Consecutively
contains 1367 spam e-mail attended to
effective machine learning
and 4361as legitimate
techniques and algorithms
for

email spam classification


and filtering
METHODOLOGY
1. DATASETS COLLECTION
Data sets was collected from Kaggle website named as ENRON email Dataset. This dataset contains approximately 500,000
emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during
its investigation of Enron's collapse. https://www.kaggle.com/datasets/wcukierski/enron-email-dataset for further findings on
the dataset.
a. Data Normalization
The dataset obtained is a raw data thereby normalization must come into play. Normalization is a technique that involves
reduction of number of distinct words in our dataset text by reducing a term to its simplest version. This is achieved by using
the python library in NLTK.

b. Feature Extraction
Since our dataset is in text we have to convert it into numerical vectors because most machine learning algorithms
rely on numerical data rather than text. This method would be accomplished with the use of Bag of words
technique.
METHODOLOGY
2. FORMULATION OF MACHINE LEARNING MODEL
A machine learning model is being formulated using supervised machine learning algorithms
such as Support vector machine, Naïve Bayes, Decision tree etc.
3. SIMULATION OF THE MACHINE LEARNING MODEL IN II
Simulation will be done using python programming language with the implementation of some
techniques like sklearn, scikit, pandas, numpy. Here, our dataset is being splitted into training
and testing sets and the dataset is also being classified into spam and non spam email.
CONTD…..
4. VAlIDATION OF THE SIMULATED MODEL AND PROTOTYPE IMPLEMENTATION
Here, the simulated model would be validated based on some Evaluation metrics such as
Accuracy, Precision, Recall, and Fmeasure .
Accuracy= / TP+ TN+ FP+ FN
Precision= True Positives / (True Positives + False Positives)
Recall= TruePositives / (TruePositives + FalseNegatives).
Finally, an executable model of the proposed system would be implemented.
THEORETICAL CONCEPTS
1. NLTK: Natural language toolkit. It is a python based set of tools and programmes for performing natural
language processing
2.BAG OF WORDS: The bag of words strategy is the most common and straightforward of all feature extraction
procedures; it generates a word presence feature set from all of an instance's words. Each document is viewed
as a collection or bag that contains all of the words. We may obtain a vector form that tells us the frequency of
each word in a document, as well as repeated words in our document. (Barushka & Hajek, 2019)
3. SCIKIT: Simple and efficient tools for predictive data analysis · Accessible to everybody, and reusable in
various contexts. The sklearn library contains a lot of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
4. PANDAS: This is a python library used for analyzing data
5. NumPy: NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate
on these arrays
THEORETICAL CONCEPT
6. PRECISION: Quantifies the number of positive class predictions that actually belong to the positive class.
7. RECALL: Quantifies the number of positive class predictions made out of all positive examples in the
dataset.
8. F-measure: Provides a single score that balances both the concerns of precision and recall in one number
ARCHITECTURE
REFERENCES
Naeem Ahmed, Rashid Amin, Hamza Aldabbas, Deepika Koundal, Bader Alouffi, T. S. (2022). Machine Learning Techniques for Spam
Detection in Email and IoT Platforms : Analysis and Research Challenges. 1–36. https://doi.org/10.1155/2022/1862888

Kaddoura, S., Chandrasekaran, G., Popescu, D. E., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection
and classification. 1–28. https://doi.org/10.7717/peerj-cs.830

Jazzar, M., F. Yousef, R., & Eleyan, D. (2021). Evaluation of Machine Learning Techniques for Email Spam Classification. International
Journal of Education and Management Engineering, 11(4), 35–42. https://doi.org/10.5815/ijeme.2021.04.04

Wikipedia. (2017). Machine learning. In Machine Learning (Vol. 45, Issue 13, pp. 40–48). https://en.wikipedia.org/wiki/Machine_learning

Sultana, T., Sapnaz, K. A., Sana, F., & Najath, J. (2022). Email based Spam Detection. 1–9. https://doi.org/10.17577/IJERTV9IS060087

You might also like