You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367117336

EFFICIENT PREDICTION OF PHISHING WEBSITES USING MULTILAYER


PERCEPTRON (MLP)

Article in Journal of Theoretical and Applied Information Technology · January 2023

CITATIONS READS

6 195

3 authors:

Ammar Odeh Abdalraouf Almahdi Alarbi


Princess Sumaya University for Technology Karabuk University
83 PUBLICATIONS 812 CITATIONS 3 PUBLICATIONS 10 CITATIONS

SEE PROFILE SEE PROFILE

Eman Abdelfattah
University of Bridgeport
35 PUBLICATIONS 502 CITATIONS

SEE PROFILE

All content following this page was uploaded by Abdalraouf Almahdi Alarbi on 13 January 2023.

The user has requested enhancement of the downloaded file.


Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

EFFICIENT PREDICTION OF PHISHING WEBSITES USING


MULTILAYER PERCEPTRON (MLP)
AMMAR ODEH 1, ABDALRAOUF ALARBI 2, ISMAIL KESHTA 3, EMAN ABDELFATTAH 4
1
Princess Sumaya University for Technology P.O.Box 1438 Al-Jubaiha - Amman, 11941 Jordan
2
Department of Computer Engineering, Karabuk University,78050, Karabuk, Turkey.
3
College of Applied Sciences, AlMaarefa University Riyadh, Kingdom of Saudi Arabia
4
School of Theoretical & Applied Science, Ramapo College of New Jersey, NJ , USA
1
a.odeh@psut.edu.jo, 2 abdalraoufa.m.alarbi@ogrenci.karabuk.edu.tr, 3imohamed@mcst.edu.sa,
4
eabdelfa@ramapo.edu

ABSTRACT

Maximizing user protection from Phishing website is a primary objective in the design of these networks.
Intelligent phishing detection management models can assist designers to achieve this objective. Our
proposed model aims to reduce the computational time and increase the security against the phishing
websites by applying the intelligent detection model. In this paper, we employed Multilayer Perceptron
(MLP) to achieve the highest accuracy and optimal training ratio to maximize internet security. The
simulation results show the selection of the most significant features minimize the computational time. The
optimal training percentage is 70% as it minimizes the time complexity and it increases the model accuracy.
Keywords: MLP, Activation function, semantic attack, Phishing

1. INTRODUCTION Phishing is an attack by an individual or a group


that uses social engineering strategies to solicit
Cyber-Attacks are classified into two classes: personally identifiable information from
Syntactic attacks and Semantic attacks. Syntactic unsuspecting customers. Phishing emails are built
attacks which are considered as malicious programs to look as if they were sent from a lawful institution
that harm computer networks or computer software or a familiar person. Often these emails try to attract
by attacking through worms, viruses, spyware or subscribers to click a link which will take the
adware [1]. In Semantic attacks, the attackers use a customer to a fraudulent site that seems credible
computer system to fool the victim users, the [7].
semantic attacks pretend to do something but they
PhishLabs report identified phishing sites in 2019
are doing something else, yet the computing system
which target 1,263 different brands belonging to
works exactly as it is intended [2, 3].
773 parent organizations. The top five targeted
The semantic attacks circumvent technological industries (Figure 1) comprised 83.9 percent of the
protections by deliberately exploiting system total amount of phishing. United States
attributes, such as system or machine applications, organizations remained the most popular target for
to trick the victim instead of targeting him/her phishing scams in 2019, ranking for 84 percent of
directly [4]. Table 1 shows families of different the total malware amount [8, 9].
semantic attacks such as Phishing, File
Contemporary browsers like Firefox typically use
Masquerading, Application Masquerading, Web
black-list lists, i.e., a comprehensive list of fake
Pop-Up, Advertisement, Social Networking,
URLs to counter phishing attacks [10, 11].
Removable Media, and Wireless [5].
Therefore, when a Link is submitted via the
Phishing is a kind of intrusion that acquires browser, the system scans the list for the URL and
sensitive users' information such as usernames, blocks the website if the entry exists. These
passwords, and other confidential information. approaches could be ineffective solutions, as the
Phishers use a variety of forms to fool users in phishers may use false addresses to pass by through
different ways, for example, email, fake link, or some filters. Studies show steady growth in both
phone call [6]. phishing activities and the associated costs [12, 13].

3353
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Cyber-attacks cost companies more than $5 million datasets. Different classification methods use
between 2013 to 2017 [14]. features, like URL-based and text-based
applications.
Phishing attacks are classified into four main
categories as shown in Figure 2. Credential Proposed software collection model hybrid set of
harvesting where the attacker sends a trusted link to features (HEFS) to identify phishing websites
spoofed login pages. In extortion, the attacker asks relying on machine learning algorithms. A
victims for money exchange as a donation. Malware cumulative distribution gradient technique is used
is a kind of hidden downloadable file as soon as the to extract the primary feature set. Then, the second
victim press in link. Spear-phishing where attacker set of features is extracted using a method called
targets high-level employees to enforce them to fill data perturbation ensemble. A Random Forests
some tasks manually [15, 16]. (RFs) model, an ensemble learner, is subsequently
implemented to identify phishing websites. The
2. LITERATURE REVIEW
results indicate that HEFS identified phishing
features with a precision of up to 94.6 percent [25].
Different researchers have conducted a lot of
2.1 Preliminaries
work in website security, some of them manipulated
This section provides a brief description of the
the routing security [17, 18], and other researchers
phishing dataset for the experimental comparison,
work with intrusion detection, intrusion prevention,
as well as background about the search algorithm,
and smart grids security [19].
heat map, and a multilayer perceptron (MLP)
Pawan Parakash proposed two methods to algorithm used in this study.
identify phishing website. The first proposed
2.2 Dataset
method introduced the five heuristics to enumerate
The dataset used are collected from PhishTank
the combination if they are known phishing
archive [26], MillerSmiles archive [27] and Google
websites to find out the new phishing websites. The
searching operators. The website phishing dataset
second method used the matching algorithms to
consists of 30 features. These features were
find out the new phishing websites [20].
classified into four categories: Address Bar
Samuel Marchal analyzed the URL of the features, abnormal features, HTML and JavaScript
websites and extracted the features of the URL. features, and Domain features.
Based on the several queries through Google and
2.3 Search algorithm (CfsSubsetEval)
Yahoo search engines, the authors determined the
Correlation-based Feature Subset Selection for
keywords for each website. Then, the keywords
machine learning evaluates the importance of a
with extracted features used in machine learning
subset of attributes by calculating the individual
classification algorithm to find out the phishing
predictive capabilities of each function along with
websites from the real dataset [21]. In [22], authors
the degree of consistency among them. The heat
introduced models using machine learning and data
map is a Visual presentation of values where the
mining algorithms to detect websites’ phishing.
features found in the graph are described as colors
The authors in [23] used the artificial neural [28].
network to spot phishing websites. The proposed
2.4 A Multilayer Perceptron (MLP)
work used 17 neurons as input that match 17
A MLP is a feeding forward artificial neural
characteristics in the dataset and one hidden layer
network (ANN). A MLP consists of a large number
level and two neurons as output to decide whether
of extremely connected neurons running
or not the website is phishing. The dataset was
concurrently to achieve certain tasks. Mainly a
divided as 80 percent for training set and 20 percent
MLP contains input and output layers, and some
for testing set. The model achieved 92.48 percent
hidden (intermediate) layer(s). Each node contains
accuracy.
an activation function (sigmoid, RBF). The core
Authors in [24] introduced a model relying on mechanism of the MLP network consists of signals
machine learning techniques called PLIFER. This flowing chronologically through multiple layers
model requires an age of the URL domain (?). In from the input to the output layer [29].
addition, ten features are extracted and a Random
The training phase at MLP consists of three
Forests (RFs) model is used to identify the phishing
steps, the first step is input pattern X of the dataset
website. 96% of phishing emails were correctly
then the output is generated and compared with the
identified by this model. Classification models are
desired output. The second step is back propagated
also used to identify phishing utilizing labeled

3354
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

based on the error signal between the network’s Table 2 lists the values of the important
output and the desired output. The last step is parameters such as learning rate, number of epochs
synaptic weights. This process is repeated for the (number of passes through all instances in the
next input vector until all instances in the training dataset), and number of hidden layers, Batch size,
set are processed [30]. and momentum.
3. THE PROPOSED SYSTEM This experiment was conducted on the Phishing
In this work, an intelligent neural network model Websites dataset; the dataset contains 30 attributes
for efficient phishing website detection on the (one of them is a label). MATLAB is used to apply
Internet is presented with the use of the ranking for features from the most significant to
classification algorithm. In this study, a web least significant, and Python is used to draw the
phishing dataset is used to evaluate the performance heat map as shown in Figure 4. Also, WEKA
of the intelligent algorithm in terms of classification simulator v3.6 is used in the MLP classification
accuracy. process.
Figure 3 shows the block diagram of the 5. DISCUSSION OF RESULTS
proposed system. In the first step, the data are read To evaluate the performance of the intelligent
and the needed features and their categories are classification algorithm MLP, the confusion matrix
recognized. Then, the dataset is cleaned and is used [34, 35]. The confusion matrix gives a
prepared in the proper format to read the file in visualization of how the classifier has performed on
MATLAB and Python. the input dataset. Different performance metrics,
such as recall, precision, accuracy, and F-measure,
The second step is processing which consists of
can be derived from this matrix. The confusion
three functions to be performed on the Phishing
matrix consists of four possible outcomes as shown
website dataset. The first function is Rank () to sort
in Table 3, which are false positive (FP), true
the feature from the most significant to the least
positive (TP), false negative (FN), and true negative
significant according to their correlation to the class
(TN) [36].
attribute. Based on the ranking function, the
significance of each feature is calculated. Then, False Positives (FP) occur when the actual class
these features are sorted in descending order. For of the test sample is negative and is wrongly
the ranking purpose, the MATLAB built-in marked as positive. True Negatives (TN) occur
procedure called independent significance features when the actual class of the test sample is negative
test (IndFeat()) is used [31, 32]. Then, the attribute and is marked correctly as negative. False
evaluator Correlation-based Feature Selection Negatives (FN) occur when the actual class of the
(CfsSubsetEval()) [33] based on specific searching test sample is positive and is wrongly marked as
method is applied. Then, the intersection is negative. True Positives (TP) occur when the actual
performed between the output features from class of the test sample is positive and is marked
IndFeat() and CfsSubsetEval() to utilize the best correctly as positive.
features to determine if the URL is phishing or not.
Figure 6 demonstrates the output of the
In step 4, a MLP classifier is applied on the experiments in different training ratio
selected N features, based on the training dataset
(50%, 60%, 70%, and 80%). Based on the output
the machine learning model builds the optimal
of the confusion matrix, the accuracy and F-
knowledge base. The intelligent model learns the
Measure are calculated.
correlation between the N features and the expected
output. After that, the testing dataset will pass Precision = TruePositives / (TruePositives +
through the intelligent system. Then, the intelligent FalsePositives) (1)
model is evaluated by measuring different Recall = TruePositives / (TruePositives +
performance metrics such as classification accuracy FalseNegatives) (2)
and computational speed. Accuracy = TP+TN/TP+FP+FN+TN (3)
6. CONCLUSION
4. EXPERIMENTAL WORK

This paper presents an intelligent model for


The proposed model is set up based on the
detecting phishing websites on the Internet. It
following experimental parameters as shown in
provides a comparative study among four training
Table 2.
percentages by using MLP classifiers. The main
contribution of the proposed system is to build a

3355
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

real-time intelligent classifier. In addition, the [9] CISM Max Alexander and CISSP CRISC,
proposed intelligent system reduces the "Protect, Detect and Correct Methodology to
computational time by applying features selection Mitigate Incidents: Insider Threats," 2018.
in the processing phase. The aim is to determine the [10] Peng Peng, Limin Yang, Linhai Song, and Gang
most appropriate percentage of the training set Wang, "Opening the blackbox of virustotal:
using the MLP classification model for detecting Analyzing online phishing scan engines," in
phishing websites. It is observed that as the training Proceedings of the Internet Measurement
percentage increases, the training time and Conference, 2019, pp. 478-485.
computational complexity increases as well. [11] Routhu Srinivasa Rao and Alwyn Roshan Pais,
For future work, we intend to evaluate the "Detection of phishing websites using an
performance of other machine learning classifiers efficient feature-based machine learning
and compare them to find the best one that framework," Neural Computing and
improves the URL security. Applications, vol. 31, pp. 3851-3873, 2019.
[12] Silas Formunyuy Verkijika, "“If you know what
to do, will you take action to avoid mobile
REFERENCES: phishing attacks”: Self-efficacy, anticipated
regret, and gender," Computers in Human
[1] Vysakh S Mohan, R Vinayakumar, KP Soman, Behavior, vol. 101, pp. 286-296, 2019.
and Prabaharan Poornachandran, "SPOOF net: [13] Liaqat Ali, "Cyber Crimes-A Constant Threat
syntactic patterns for identification of ominous For The Business Sectors And Its Growth (A
online factors," in 2018 IEEE Security and Study Of The Online Banking Sectors In
Privacy Workshops (SPW), 2018, pp. 258-263. GCC)," The Journal of Developing Areas, vol.
[2] Mohsen Rakhshandehroo and Mohammad 53, 2019.
Rajabdorri, "Time Series Analysis of Electricity [14] Sachin Kumar, "Cyber attacks & Its Security
Price and Demand to Find Cyber-attacks using Predictions in 2020," CYBERNOMICS, vol. 1,
Stationary Analysis," arXiv preprint pp. 39-43, 2019.
arXiv:1907.11651, 2019. [15] Jason Thomas, "Individual cyber security:
[3] BB Gupta and Pooja Chaudhary, "Cross-Site Empowering employees to resist spear phishing
Scripting Attacks: Classification, Attack, and to prevent identity theft and ransomware
Countermeasures," 2020. attacks," Thomas, JE (2018). Individual cyber
[4] Yan Hu, Yuyan Sun, Youcheng Wang, and security: Empowering employees to resist spear
Zhiliang Wang, "An Enhanced Multi-Stage phishing to prevent identity theft and
Semantic Attack Against Industrial Control ransomware attacks. International Journal of
Systems," IEEE Access, vol. 7, pp. 156871- Business Management, vol. 12, pp. 1-23, 2018.
156882, 2019. [16] Meir Jonathan Dahan, Lior Drihem, Amnon
[5] Matthijs Vos, "Characterizing infrastructure of Perlmutter, and TAM Ofir, "System and method
DDoS attacks based on DDoSDB fingerprints," to detect and prevent phishing attacks," ed:
University of Twente, 2019. Google Patents, 2017.
[6] Surbhi Gupta, Abhishek Singhal, and Akanksha [17] Abdul Basit and Naveed Ahmed, "Path
Kapoor, "A literature survey on social diversity for inter-domain routing security," in
engineering attacks: Phishing attack," in 2016 2017 14th international Bhurban conference on
international conference on computing, applied sciences and technology (IBCAST),
communication and automation (ICCCA), 2016, 2017, pp. 384-391.
pp. 537-540. [18] Yehuda Binder, "System and method for
[7] Brij B Gupta, Nalin AG Arachchilage, and routing-based internet security," ed: Google
Kostas E Psannis, "Defending against phishing Patents, 2015.
attacks: taxonomy of methods, current issues [19] Abdulrahaman Okino Otuoze, Mohd Wazir
and future directions," Telecommunication Mustafa, and Raja Masood Larik, "Smart grids
Systems, vol. 67, pp. 247-267, 2018. security challenges: Classification by sources of
[8] Michael Fiermonte, "The Threat of Social threats," Journal of Electrical Systems and
Engineering to Networked Systems," Utica Information Technology, vol. 5, pp. 468-483,
College, 2019. 2018.

3356
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

[20] Pawan Prakash, Manish Kumar, Ramana Rao [30] Sankhadeep Chatterjee, Sarbartha Sarkar,
Kompella, and Minaxi Gupta, "Phishnet: Sirshendu Hore, Nilanjan Dey, Amira S Ashour,
predictive blacklisting to detect phishing and Valentina E Balas, "Particle swarm
attacks," in 2010 Proceedings IEEE INFOCOM, optimization trained neural network for
2010, pp. 1-5. structural failure prediction of multistoried RC
[21] Samuel Marchal, Jérôme François, Radu State, buildings," Neural Computing and Applications,
and Thomas Engel, "Phishstorm: Detecting vol. 28, pp. 2005-2016, 2017.
phishing with streaming analytics," IEEE [31] MATLAB and Statistics Toolbox Release
Transactions on Network and Service 2012b MathWorks, MathWorks, Natick, Mass,
Management, vol. 11, pp. 458-471, 2014. USA, 2012.
[22] Neda Abdelhamid, Aladdin Ayesh, and Fadi [32] Predictive Data Mining: A Practical Guide S. H.
Thabtah, "Phishing detection based associative Weiss and N. Indurkhya, Morgan Kaufmann
classification data mining," Expert Systems Publishers, San Francisco, Calif, USA, 1998.
with Applications, vol. 41, pp. 5948-5959, [33] http://www.cs.waikato.ac.nz/ml/weka/. (2020).
2014. [34] Alem Abdelkader, Dahmani Youcef, and Allel
[23] Rami M Mohammad, Fadi Thabtah, and Lee Hadjali, "On the use of belief functions to
McCluskey, "Predicting phishing websites improve high performance intrusion detection
based on self-structuring neural network," system," in 2016 12th International Conference
Neural Computing and Applications, vol. 25, on Signal-Image Technology & Internet-Based
pp. 443-458, 2014. Systems (SITIS), 2016, pp. 266-270.
[24] Solomon Ogbomon Uwagbole, William J [35] Wahiba Ben Abdessalem Karaa, Amira S
Buchanan, and Lu Fan, "Applied machine Ashour, Dhekra Ben Sassi, Payel Roy, Noreen
learning predictive analytics to SQL injection Kausar, and Nilanjan Dey, "Medline text
attack detection and prevention," in 2017 mining: an enhancement genetic algorithm
IFIP/IEEE Symposium on Integrated Network based approach for document clustering," in
and Service Management (IM), 2017, pp. 1087- Applications of Intelligent Optimization in
1090. Biology and Medicine, ed: Springer, 2016, pp.
[25] Kang Leng Chiew, Choon Lin Tan, KokSheik 267-287.
Wong, Kelvin SC Yong, and Wei King Tiong, [36] Paulo Cavalin and Luiz Oliveira, "Confusion
"A new hybrid ensemble feature selection Matrix-Based Building of Hierarchical
framework for machine learning-based phishing Classification," in Iberoamerican Congress on
detection system," Information Sciences, vol. Pattern Recognition, 2018, pp. 271-278.
484, pp. 153-166, 2019.
[26] P PhishTank, "Join the fight against phishing,"
ed, 2016.
[27] Ravi Kiran Varma Penmatsa and Padmaprabha
Kakarlapudi, "Web phishing detection: feature
selection using rough sets and ant colony
optimisation," International Journal of
Intelligent Systems Design and Computing, vol.
2, pp. 102-113, 2018.
[28] K Selvakuberan, M Indradevi, and R Rajaram,
"Combined Feature Selection and
classification–A novel approach for the
categorization of web pages," Journal of
Information and Computing Science, vol. 3, pp.
083-089, 2008.
[29] Ali Asghar Heidari, Hossam Faris, Seyedali
Mirjalili, Ibrahim Aljarah, and Majdi Mafarja,
"Ant lion optimizer: theory, literature review,
and application in multi-layer perceptron neural
networks," in Nature-Inspired Optimizers, ed:
Springer, 2020, pp. 23-46.

3357
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Table 1. Families of Semantic Attacks

Semantic Attack Tools


Brute‐Force Attack an end‐all method to crack a difficult password
Dictionary Attack the attacker uses a dictionary in an attempt to guess the password
Denial‐Of‐Service The attack focuses on the interruption of a network service.
Attack
Backdoor Any secret method of bypassing normal authentication or security controls.
Eavesdropping listening to a private conversation
Spoofing falsifying data
Privilege Escalation an attacker able to fool the system into giving them access to restricted data
Phishing The attacker uses Email, Website, URL to crack usernames, passwords and credit
card details directly from users
Clickjacking the attacker tricks a user into clicking on a button
File Masquerading The attacker uses the name of the file is maliciously called anything close to one
that could be trusted

Figure 1. Top five targeted industries

3358
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Figure 2. Phishing Attacks categories

Figure 3. Block diagram of the proposed system.

3359
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

The scheme operates in five stages, which are as follows:


1. Read the dataset.
2. Preprocessing
3. Processing
a) Select attribute [Calculate significance level of feature, Sort in descending
order.]
i. Rank
ii. Attribute evaluator
iii. Search method
4. Machine learning.
5. Performance evaluation.

Table 2. Experimental parameters.

Parameter Value
Learning rate for MLP 0.3
Number of epochs for MLP 500
Number of hidden layers for MLP 1
Number of hidden neurons for MLP 1
Batch Size 100
Momentum 0.2

Figure 4. Heat map for features correlation

3360
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Figure 5. Structure of MLP

3361
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Table 3. Confusion matrix.

Predicted class

Positive Negative

Positive TP FP
Actual class
Negative FN TN

Figure 6. The output of Confusion matrix in different training Ratios

3362
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

Figure 7. Percentages of the training data versus the accuracies and F-measures

Table 4. Comparison with other algorithms using 70 % training dataset

Paper Machine Learning Algorithm Accuracy


[18] NN 94.07%
[19] multi‐label rule‐based 94.8%
[20] NN 84%
[21] FFNN 87%
[22] feed forward NN 97.40%
[23] logistic regression classifier 98.40%
[24] Naïve Bayesian classifier 90%
[25] HNB and J48 96.25%
Proposed Model 99.1

3363

View publication stats

You might also like