Professional Documents
Culture Documents
net/publication/367117336
CITATIONS READS
6 195
3 authors:
Eman Abdelfattah
University of Bridgeport
35 PUBLICATIONS 502 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abdalraouf Almahdi Alarbi on 13 January 2023.
ABSTRACT
Maximizing user protection from Phishing website is a primary objective in the design of these networks.
Intelligent phishing detection management models can assist designers to achieve this objective. Our
proposed model aims to reduce the computational time and increase the security against the phishing
websites by applying the intelligent detection model. In this paper, we employed Multilayer Perceptron
(MLP) to achieve the highest accuracy and optimal training ratio to maximize internet security. The
simulation results show the selection of the most significant features minimize the computational time. The
optimal training percentage is 70% as it minimizes the time complexity and it increases the model accuracy.
Keywords: MLP, Activation function, semantic attack, Phishing
3353
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
Cyber-attacks cost companies more than $5 million datasets. Different classification methods use
between 2013 to 2017 [14]. features, like URL-based and text-based
applications.
Phishing attacks are classified into four main
categories as shown in Figure 2. Credential Proposed software collection model hybrid set of
harvesting where the attacker sends a trusted link to features (HEFS) to identify phishing websites
spoofed login pages. In extortion, the attacker asks relying on machine learning algorithms. A
victims for money exchange as a donation. Malware cumulative distribution gradient technique is used
is a kind of hidden downloadable file as soon as the to extract the primary feature set. Then, the second
victim press in link. Spear-phishing where attacker set of features is extracted using a method called
targets high-level employees to enforce them to fill data perturbation ensemble. A Random Forests
some tasks manually [15, 16]. (RFs) model, an ensemble learner, is subsequently
implemented to identify phishing websites. The
2. LITERATURE REVIEW
results indicate that HEFS identified phishing
features with a precision of up to 94.6 percent [25].
Different researchers have conducted a lot of
2.1 Preliminaries
work in website security, some of them manipulated
This section provides a brief description of the
the routing security [17, 18], and other researchers
phishing dataset for the experimental comparison,
work with intrusion detection, intrusion prevention,
as well as background about the search algorithm,
and smart grids security [19].
heat map, and a multilayer perceptron (MLP)
Pawan Parakash proposed two methods to algorithm used in this study.
identify phishing website. The first proposed
2.2 Dataset
method introduced the five heuristics to enumerate
The dataset used are collected from PhishTank
the combination if they are known phishing
archive [26], MillerSmiles archive [27] and Google
websites to find out the new phishing websites. The
searching operators. The website phishing dataset
second method used the matching algorithms to
consists of 30 features. These features were
find out the new phishing websites [20].
classified into four categories: Address Bar
Samuel Marchal analyzed the URL of the features, abnormal features, HTML and JavaScript
websites and extracted the features of the URL. features, and Domain features.
Based on the several queries through Google and
2.3 Search algorithm (CfsSubsetEval)
Yahoo search engines, the authors determined the
Correlation-based Feature Subset Selection for
keywords for each website. Then, the keywords
machine learning evaluates the importance of a
with extracted features used in machine learning
subset of attributes by calculating the individual
classification algorithm to find out the phishing
predictive capabilities of each function along with
websites from the real dataset [21]. In [22], authors
the degree of consistency among them. The heat
introduced models using machine learning and data
map is a Visual presentation of values where the
mining algorithms to detect websites’ phishing.
features found in the graph are described as colors
The authors in [23] used the artificial neural [28].
network to spot phishing websites. The proposed
2.4 A Multilayer Perceptron (MLP)
work used 17 neurons as input that match 17
A MLP is a feeding forward artificial neural
characteristics in the dataset and one hidden layer
network (ANN). A MLP consists of a large number
level and two neurons as output to decide whether
of extremely connected neurons running
or not the website is phishing. The dataset was
concurrently to achieve certain tasks. Mainly a
divided as 80 percent for training set and 20 percent
MLP contains input and output layers, and some
for testing set. The model achieved 92.48 percent
hidden (intermediate) layer(s). Each node contains
accuracy.
an activation function (sigmoid, RBF). The core
Authors in [24] introduced a model relying on mechanism of the MLP network consists of signals
machine learning techniques called PLIFER. This flowing chronologically through multiple layers
model requires an age of the URL domain (?). In from the input to the output layer [29].
addition, ten features are extracted and a Random
The training phase at MLP consists of three
Forests (RFs) model is used to identify the phishing
steps, the first step is input pattern X of the dataset
website. 96% of phishing emails were correctly
then the output is generated and compared with the
identified by this model. Classification models are
desired output. The second step is back propagated
also used to identify phishing utilizing labeled
3354
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
based on the error signal between the network’s Table 2 lists the values of the important
output and the desired output. The last step is parameters such as learning rate, number of epochs
synaptic weights. This process is repeated for the (number of passes through all instances in the
next input vector until all instances in the training dataset), and number of hidden layers, Batch size,
set are processed [30]. and momentum.
3. THE PROPOSED SYSTEM This experiment was conducted on the Phishing
In this work, an intelligent neural network model Websites dataset; the dataset contains 30 attributes
for efficient phishing website detection on the (one of them is a label). MATLAB is used to apply
Internet is presented with the use of the ranking for features from the most significant to
classification algorithm. In this study, a web least significant, and Python is used to draw the
phishing dataset is used to evaluate the performance heat map as shown in Figure 4. Also, WEKA
of the intelligent algorithm in terms of classification simulator v3.6 is used in the MLP classification
accuracy. process.
Figure 3 shows the block diagram of the 5. DISCUSSION OF RESULTS
proposed system. In the first step, the data are read To evaluate the performance of the intelligent
and the needed features and their categories are classification algorithm MLP, the confusion matrix
recognized. Then, the dataset is cleaned and is used [34, 35]. The confusion matrix gives a
prepared in the proper format to read the file in visualization of how the classifier has performed on
MATLAB and Python. the input dataset. Different performance metrics,
such as recall, precision, accuracy, and F-measure,
The second step is processing which consists of
can be derived from this matrix. The confusion
three functions to be performed on the Phishing
matrix consists of four possible outcomes as shown
website dataset. The first function is Rank () to sort
in Table 3, which are false positive (FP), true
the feature from the most significant to the least
positive (TP), false negative (FN), and true negative
significant according to their correlation to the class
(TN) [36].
attribute. Based on the ranking function, the
significance of each feature is calculated. Then, False Positives (FP) occur when the actual class
these features are sorted in descending order. For of the test sample is negative and is wrongly
the ranking purpose, the MATLAB built-in marked as positive. True Negatives (TN) occur
procedure called independent significance features when the actual class of the test sample is negative
test (IndFeat()) is used [31, 32]. Then, the attribute and is marked correctly as negative. False
evaluator Correlation-based Feature Selection Negatives (FN) occur when the actual class of the
(CfsSubsetEval()) [33] based on specific searching test sample is positive and is wrongly marked as
method is applied. Then, the intersection is negative. True Positives (TP) occur when the actual
performed between the output features from class of the test sample is positive and is marked
IndFeat() and CfsSubsetEval() to utilize the best correctly as positive.
features to determine if the URL is phishing or not.
Figure 6 demonstrates the output of the
In step 4, a MLP classifier is applied on the experiments in different training ratio
selected N features, based on the training dataset
(50%, 60%, 70%, and 80%). Based on the output
the machine learning model builds the optimal
of the confusion matrix, the accuracy and F-
knowledge base. The intelligent model learns the
Measure are calculated.
correlation between the N features and the expected
output. After that, the testing dataset will pass Precision = TruePositives / (TruePositives +
through the intelligent system. Then, the intelligent FalsePositives) (1)
model is evaluated by measuring different Recall = TruePositives / (TruePositives +
performance metrics such as classification accuracy FalseNegatives) (2)
and computational speed. Accuracy = TP+TN/TP+FP+FN+TN (3)
6. CONCLUSION
4. EXPERIMENTAL WORK
3355
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
real-time intelligent classifier. In addition, the [9] CISM Max Alexander and CISSP CRISC,
proposed intelligent system reduces the "Protect, Detect and Correct Methodology to
computational time by applying features selection Mitigate Incidents: Insider Threats," 2018.
in the processing phase. The aim is to determine the [10] Peng Peng, Limin Yang, Linhai Song, and Gang
most appropriate percentage of the training set Wang, "Opening the blackbox of virustotal:
using the MLP classification model for detecting Analyzing online phishing scan engines," in
phishing websites. It is observed that as the training Proceedings of the Internet Measurement
percentage increases, the training time and Conference, 2019, pp. 478-485.
computational complexity increases as well. [11] Routhu Srinivasa Rao and Alwyn Roshan Pais,
For future work, we intend to evaluate the "Detection of phishing websites using an
performance of other machine learning classifiers efficient feature-based machine learning
and compare them to find the best one that framework," Neural Computing and
improves the URL security. Applications, vol. 31, pp. 3851-3873, 2019.
[12] Silas Formunyuy Verkijika, "“If you know what
to do, will you take action to avoid mobile
REFERENCES: phishing attacks”: Self-efficacy, anticipated
regret, and gender," Computers in Human
[1] Vysakh S Mohan, R Vinayakumar, KP Soman, Behavior, vol. 101, pp. 286-296, 2019.
and Prabaharan Poornachandran, "SPOOF net: [13] Liaqat Ali, "Cyber Crimes-A Constant Threat
syntactic patterns for identification of ominous For The Business Sectors And Its Growth (A
online factors," in 2018 IEEE Security and Study Of The Online Banking Sectors In
Privacy Workshops (SPW), 2018, pp. 258-263. GCC)," The Journal of Developing Areas, vol.
[2] Mohsen Rakhshandehroo and Mohammad 53, 2019.
Rajabdorri, "Time Series Analysis of Electricity [14] Sachin Kumar, "Cyber attacks & Its Security
Price and Demand to Find Cyber-attacks using Predictions in 2020," CYBERNOMICS, vol. 1,
Stationary Analysis," arXiv preprint pp. 39-43, 2019.
arXiv:1907.11651, 2019. [15] Jason Thomas, "Individual cyber security:
[3] BB Gupta and Pooja Chaudhary, "Cross-Site Empowering employees to resist spear phishing
Scripting Attacks: Classification, Attack, and to prevent identity theft and ransomware
Countermeasures," 2020. attacks," Thomas, JE (2018). Individual cyber
[4] Yan Hu, Yuyan Sun, Youcheng Wang, and security: Empowering employees to resist spear
Zhiliang Wang, "An Enhanced Multi-Stage phishing to prevent identity theft and
Semantic Attack Against Industrial Control ransomware attacks. International Journal of
Systems," IEEE Access, vol. 7, pp. 156871- Business Management, vol. 12, pp. 1-23, 2018.
156882, 2019. [16] Meir Jonathan Dahan, Lior Drihem, Amnon
[5] Matthijs Vos, "Characterizing infrastructure of Perlmutter, and TAM Ofir, "System and method
DDoS attacks based on DDoSDB fingerprints," to detect and prevent phishing attacks," ed:
University of Twente, 2019. Google Patents, 2017.
[6] Surbhi Gupta, Abhishek Singhal, and Akanksha [17] Abdul Basit and Naveed Ahmed, "Path
Kapoor, "A literature survey on social diversity for inter-domain routing security," in
engineering attacks: Phishing attack," in 2016 2017 14th international Bhurban conference on
international conference on computing, applied sciences and technology (IBCAST),
communication and automation (ICCCA), 2016, 2017, pp. 384-391.
pp. 537-540. [18] Yehuda Binder, "System and method for
[7] Brij B Gupta, Nalin AG Arachchilage, and routing-based internet security," ed: Google
Kostas E Psannis, "Defending against phishing Patents, 2015.
attacks: taxonomy of methods, current issues [19] Abdulrahaman Okino Otuoze, Mohd Wazir
and future directions," Telecommunication Mustafa, and Raja Masood Larik, "Smart grids
Systems, vol. 67, pp. 247-267, 2018. security challenges: Classification by sources of
[8] Michael Fiermonte, "The Threat of Social threats," Journal of Electrical Systems and
Engineering to Networked Systems," Utica Information Technology, vol. 5, pp. 468-483,
College, 2019. 2018.
3356
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
[20] Pawan Prakash, Manish Kumar, Ramana Rao [30] Sankhadeep Chatterjee, Sarbartha Sarkar,
Kompella, and Minaxi Gupta, "Phishnet: Sirshendu Hore, Nilanjan Dey, Amira S Ashour,
predictive blacklisting to detect phishing and Valentina E Balas, "Particle swarm
attacks," in 2010 Proceedings IEEE INFOCOM, optimization trained neural network for
2010, pp. 1-5. structural failure prediction of multistoried RC
[21] Samuel Marchal, Jérôme François, Radu State, buildings," Neural Computing and Applications,
and Thomas Engel, "Phishstorm: Detecting vol. 28, pp. 2005-2016, 2017.
phishing with streaming analytics," IEEE [31] MATLAB and Statistics Toolbox Release
Transactions on Network and Service 2012b MathWorks, MathWorks, Natick, Mass,
Management, vol. 11, pp. 458-471, 2014. USA, 2012.
[22] Neda Abdelhamid, Aladdin Ayesh, and Fadi [32] Predictive Data Mining: A Practical Guide S. H.
Thabtah, "Phishing detection based associative Weiss and N. Indurkhya, Morgan Kaufmann
classification data mining," Expert Systems Publishers, San Francisco, Calif, USA, 1998.
with Applications, vol. 41, pp. 5948-5959, [33] http://www.cs.waikato.ac.nz/ml/weka/. (2020).
2014. [34] Alem Abdelkader, Dahmani Youcef, and Allel
[23] Rami M Mohammad, Fadi Thabtah, and Lee Hadjali, "On the use of belief functions to
McCluskey, "Predicting phishing websites improve high performance intrusion detection
based on self-structuring neural network," system," in 2016 12th International Conference
Neural Computing and Applications, vol. 25, on Signal-Image Technology & Internet-Based
pp. 443-458, 2014. Systems (SITIS), 2016, pp. 266-270.
[24] Solomon Ogbomon Uwagbole, William J [35] Wahiba Ben Abdessalem Karaa, Amira S
Buchanan, and Lu Fan, "Applied machine Ashour, Dhekra Ben Sassi, Payel Roy, Noreen
learning predictive analytics to SQL injection Kausar, and Nilanjan Dey, "Medline text
attack detection and prevention," in 2017 mining: an enhancement genetic algorithm
IFIP/IEEE Symposium on Integrated Network based approach for document clustering," in
and Service Management (IM), 2017, pp. 1087- Applications of Intelligent Optimization in
1090. Biology and Medicine, ed: Springer, 2016, pp.
[25] Kang Leng Chiew, Choon Lin Tan, KokSheik 267-287.
Wong, Kelvin SC Yong, and Wei King Tiong, [36] Paulo Cavalin and Luiz Oliveira, "Confusion
"A new hybrid ensemble feature selection Matrix-Based Building of Hierarchical
framework for machine learning-based phishing Classification," in Iberoamerican Congress on
detection system," Information Sciences, vol. Pattern Recognition, 2018, pp. 271-278.
484, pp. 153-166, 2019.
[26] P PhishTank, "Join the fight against phishing,"
ed, 2016.
[27] Ravi Kiran Varma Penmatsa and Padmaprabha
Kakarlapudi, "Web phishing detection: feature
selection using rough sets and ant colony
optimisation," International Journal of
Intelligent Systems Design and Computing, vol.
2, pp. 102-113, 2018.
[28] K Selvakuberan, M Indradevi, and R Rajaram,
"Combined Feature Selection and
classification–A novel approach for the
categorization of web pages," Journal of
Information and Computing Science, vol. 3, pp.
083-089, 2008.
[29] Ali Asghar Heidari, Hossam Faris, Seyedali
Mirjalili, Ibrahim Aljarah, and Majdi Mafarja,
"Ant lion optimizer: theory, literature review,
and application in multi-layer perceptron neural
networks," in Nature-Inspired Optimizers, ed:
Springer, 2020, pp. 23-46.
3357
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
3358
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
3359
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
Parameter Value
Learning rate for MLP 0.3
Number of epochs for MLP 500
Number of hidden layers for MLP 1
Number of hidden neurons for MLP 1
Batch Size 100
Momentum 0.2
3360
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
3361
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
Predicted class
Positive Negative
Positive TP FP
Actual class
Negative FN TN
3362
Journal of Theoretical and Applied Information Technology
31st August 2020. Vol.98. No 16
© 2005 – ongoing JATIT & LLS
Figure 7. Percentages of the training data versus the accuracies and F-measures
3363