Professional Documents
Culture Documents
125/18/2/0046
CYBER SECURITY
DEPARTMENT OF COMPUTER SCIENCES
SUPERVISED BY: Dr. ABIMBOLA R.O
i. Collection of Dataset
(Yuliya et al., 4,360 non-spam and Logistic Regression (LR), Presented a comparative analysis of For extracting important Accuracy-0.99
2021) 1,368 spam samples from Naïve Bayes (NB), K-Nearest different ML algorithms features, better DL-based Precision-0.97
the Kaggle Dataset Neighbor (K-NN) and Decision feature learning Recall-0.99
Trees (DT) algorithms can be used F-measure-0.98
(Tushaar et al., Spam Assassin and Random Forest (RF) Classifier conducted a comprehensive absence of an effective accuracy of 98.4% on
2020) Phishing corpus comparison research using different
state-of-the-art machine learning strategy for dealing with ham− spam dataset
classifiers to aid UBE filtering and security assaults on UBE (AUPRC of 99.8%) and
classification
filters, inability of present 99.4% on
UBE filters to deal with ham−phishing dataset
concept drift, and lack of (AUPRC of 99.9%)
effective UBE filters that
make use of graphical
characteristics
GAP ANALYSIS CONTD…
(Bilge & Bahriye, 2020) Enron, CSDMC2010, and Artificial bee colony proposed a novel spam In comparison to other a 99.25% success rate on
Turkish Email algorithm with a logistic detection method methods, it has a high
regression classification computational complexity the Turkish Email dataset
(except for MLPs, CNNs, and a 98.70% success rate
and DBB-RDNN-ReL
on the CSDMC2010
dataset
(Mahmoud et al., 2021) Support Vector Machine, Naïve Bayes, J48, Support
Dataset records from UCI demonstrates and reviews unexpected rise in volume
Artificial Neural Network, vector machine and ANN
machine learning j48, and Naïve Bayes the performance evaluation of false positives and false methods has an accuracy of
92.8%, 91.8%, 93.91% and
repository. The dataset of the most popular and negatives was not properly
91.05% Consecutively
contains 1367 spam e-mail attended to
effective machine learning
and 4361as legitimate
techniques and algorithms
for
b. Feature Extraction
Since our dataset is in text we have to convert it into numerical vectors because most machine learning algorithms
rely on numerical data rather than text. This method would be accomplished with the use of Bag of words
technique.
METHODOLOGY
2. FORMULATION OF MACHINE LEARNING MODEL
A machine learning model is being formulated using supervised machine learning algorithms
such as Support vector machine, Naïve Bayes, Decision tree etc.
3. SIMULATION OF THE MACHINE LEARNING MODEL IN II
Simulation will be done using python programming language with the implementation of some
techniques like sklearn, scikit, pandas, numpy. Here, our dataset is being splitted into training
and testing sets and the dataset is also being classified into spam and non spam email.
CONTD…..
4. VAlIDATION OF THE SIMULATED MODEL AND PROTOTYPE IMPLEMENTATION
Here, the simulated model would be validated based on some Evaluation metrics such as
Accuracy, Precision, Recall, and Fmeasure .
Accuracy= / TP+ TN+ FP+ FN
Precision= True Positives / (True Positives + False Positives)
Recall= TruePositives / (TruePositives + FalseNegatives).
Finally, an executable model of the proposed system would be implemented.
THEORETICAL CONCEPTS
1. NLTK: Natural language toolkit. It is a python based set of tools and programmes for performing natural
language processing
2.BAG OF WORDS: The bag of words strategy is the most common and straightforward of all feature extraction
procedures; it generates a word presence feature set from all of an instance's words. Each document is viewed
as a collection or bag that contains all of the words. We may obtain a vector form that tells us the frequency of
each word in a document, as well as repeated words in our document. (Barushka & Hajek, 2019)
3. SCIKIT: Simple and efficient tools for predictive data analysis · Accessible to everybody, and reusable in
various contexts. The sklearn library contains a lot of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
4. PANDAS: This is a python library used for analyzing data
5. NumPy: NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate
on these arrays
THEORETICAL CONCEPT
6. PRECISION: Quantifies the number of positive class predictions that actually belong to the positive class.
7. RECALL: Quantifies the number of positive class predictions made out of all positive examples in the
dataset.
8. F-measure: Provides a single score that balances both the concerns of precision and recall in one number
ARCHITECTURE
REFERENCES
Naeem Ahmed, Rashid Amin, Hamza Aldabbas, Deepika Koundal, Bader Alouffi, T. S. (2022). Machine Learning Techniques for Spam
Detection in Email and IoT Platforms : Analysis and Research Challenges. 1–36. https://doi.org/10.1155/2022/1862888
Kaddoura, S., Chandrasekaran, G., Popescu, D. E., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection
and classification. 1–28. https://doi.org/10.7717/peerj-cs.830
Jazzar, M., F. Yousef, R., & Eleyan, D. (2021). Evaluation of Machine Learning Techniques for Email Spam Classification. International
Journal of Education and Management Engineering, 11(4), 35–42. https://doi.org/10.5815/ijeme.2021.04.04
Wikipedia. (2017). Machine learning. In Machine Learning (Vol. 45, Issue 13, pp. 40–48). https://en.wikipedia.org/wiki/Machine_learning
Sultana, T., Sapnaz, K. A., Sana, F., & Najath, J. (2022). Email based Spam Detection. 1–9. https://doi.org/10.17577/IJERTV9IS060087