You are on page 1of 5

Wachemo University

College of Post Graduate Studies


School of Computing and Informatics
Department of Information Technology

Review of a Journal Article


on
Classification of Phishing Email Using Word Embedding and Machine
Learning Techniques

Submitted to: Sofonias Yitagesu (Ph.D.)

Name: Wadola Habte


ID: WCU/E/1401174

Hossana, Ethiopia
February 2023
1. Bibliographic Reference of the Article
The authors of the article: Somesha M. and Alwyn R. Pais
Title of the article: Classification of Phishing Email Using Word Embedding and Machine
Learning Techniques
The year of publication: 26 April 2022
Published on Journal: Journal of Cyber Security and Mobility
Volume, Issue: 11, 3
Number of pages: 45
DOI: https://doi.org/10.13052/jcsm2245-1439.1131
2. Summary of the Article
The research article I tried to review is an investigation conducted on Email security domain
to classify Emails as phishing (spam) or legitimate (ham). Phishing attacks are a basic part of many
cyberattacks and cause significant financial harm to businesses and corporations. Email
classification is a crucial remedy to address this problem. A key technique for aiding in the
detection of fraudulent emails is artificial intelligence, particularly machine learning[1], [2]. This
shows that researching on this problem is pretty much important. The authors have reviewed other
research works related to their study and identified the gaps those works did not address. The
research gaps identified were:
 Most of studies in the reviewed literatures used only open-source datasets available in
the open access repositories
 Some of the proposed methods require high computation time and memory because of
their complex architecture and large set of features
 High rate of misclassification on test data
 Use of very small dataset
 Higher false negative rates
 Lower performance on some studies compared to others
 Some studies used only one-hot encoding for vector generation
 Imbalance datasets used and over fitting

1|Page
In order to address the gaps identified in the earlier investigations, the authors proposed a machine
learning technique combined with word embedding methods and used real-time dataset in addition
to the datasets available in the open-access repositories. The need to collect real-time data was
because of the behavior and methods of the attackers change in time and the existing open-source
datasets cannot capture that according to the study. The proposed approach of the study included
five machine learning classifiers combined with six word embedding techniques and the
performance of each combination was evaluated. The machine learning algorithms used in the
proposed work are Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM),
XGBoost, and Logistic Regression (LR). And the word embedding techniques used are
Word2Vec-CBOW, Word2Vec-SkipGram, FastText-CBOW, FastText-SkipGram, TF-IDF, and
Count Vectorization. Among all classifiers, the RF algorithm achieved the best performance
accuracy in combination with FastText-CBOW, TF-IDF and Count Vectorizer (CV) for the three
datasets used. The result obtained is competitive and better than most of the studies previously
conducted.

3. Assessment of the quality of data and methods


The research article has made contributions by:
 creating real-time phishing and legitimate email datasets that achieved an accuracy
approximate to the publicly available datasets
 proposing a novel phishing email detection approach using machine learning
combined with word embedding that uses only four email header features and achieved
a competitive accuracy

The datasets for the study were prepared using two methods: open-source legitimate and phishing
email datasets and the real-time data generation. The real-time data were collected from family
members, friends, research scholars and institution students. Finally, the datasets prepared were
separated in three different datasets each containing number of unique spam and ham emails. Four
labels of email header namely From, Return-Path, Subject and Message-ID are used as input
features for the model.

2|Page
4. Assessment of research experimental or theoretical results

The authors have used different evaluation metrics during the experiment to test the
performance of the proposed model. The matrices used are Precision, Recall, Specify, F-score,
Accuracy, and Matthews Correlation Coefficient (MCC). The experimentation on the proposed
model was carried on three times for each of the five-machine learning combined with six word
embedding methods and the results of evaluation metrics recorded. And the RF algorithm with
FastText -CBOW, TF-IDF and CV outperformed all the other combinations. The evaluation
matrices used to test the proposed model are suitable for measuring the performance machine
learning based models used for binary classification[3]. The result obtained from the model was
also compared with other previous works and it outperformed most of them. However, one
previous work done by[4] in 2019 (i.e., cited as Y. Fang et al in the study) has better accuracy and
false positive rate. Even if the authors of the study claimed that their approach outperformed all
the previous works and concluded that RF algorithm is the best choice for phishing email
classification, the improved recurrent CNN approach proposed by Y. Fang et al has better
accuracy.

5. Critical evaluation of the article

The title assigned for the study is appropriate and can indicate what is going to be done. The
abstract part of the article is also well written in because the problem that has been addressed, the
method used, the data sources and the results obtained from experiment are summarized in it. The
introduction part is also good since it provides background information about the study topic and
summarizes the basic contributions of the study. The literatures reviewed in the study are from
2009 – 2021 and that indicates some of them are outdated in terms of the time the study was
conducted. The conclusion part of the article has summarized the finds of the study and future
recommendations.

3|Page
Reference
[1] S. M. and A. R. Pais, “Classification of Phishing Email Using Word Embedding and Machine
Learning Techniques,” J. Cyber Secur. Mobil., May 2022, doi: 10.13052/jcsm2245-
1439.1131.
[2] W. Li, L. Ke, W. Meng, and J. Han, “An empirical study of supervised email classification in
Internet of Things: Practical performance and key influencing factors,” Int. J. Intell. Syst., vol.
37, no. 1, pp. 287–304, Jan. 2022, doi: 10.1002/int.22625.
[3] A. D’Agostino, “The Explanation You Need on Binary Classification Metrics,” Medium, Aug.
22, 2022. https://towardsdatascience.com/the-explanation-you-need-on-binary-classification-
metrics-321d280b590f (accessed Feb. 25, 2023).
[4] Y. Fang, C. Zhang, C. Huang, L. Liu, and Y. Yang, “Phishing Email Detection Using
Improved RCNN Model With Multilevel Vectors and Attention Mechanism,” IEEE Access,
vol. 7, pp. 56329–56340, 2019, doi: 10.1109/ACCESS.2019.2913705.

4|Page

You might also like