Detecting Phishing Websites Using Machine Learning

Detecting Phishing Websites Using Machine
Learning
2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) | 978-1-6654-6835-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/HORA55278.2022.9799917
Safa Alrefaai Ghina Özdemir Afnan Mohamed

Computer Engineering Computer Engineering Computer Engineering
Istanbul Kultur University Istanbul Kultur University Istanbul Kultur University
Istanbul, Turkey Istanbul, Turkey Istanbul, Turkey
1700005186@stu.iku.edu.tr 1800004497@stu.iku.edu.tr 1900004466@stu.iku.edu.tr
Abstract—Phishing, a cybercriminal’s attempted attack, is a external link, such as an advertisement or another pop-up link.
social web-engineering attack in which valuable data or personal If the user clicks on a phishing link, the user is taken to
information might be stolen from either email addresses or that website. The attacker then takes the victim’s information
websites. There are many methods available to detect phishing,
but new ones are being introduced in an attempt to increase and uses it to get access to other legitimate websites. To
detection accuracy and decrease phishing websites’ success to detect this type of phishing attack many different detection
steal information. Phishing is generally detected using Machine processes are suggested and implemented in the literature. The
Learning methods with different kinds of algorithms. In this simplest approach is about the use of signature-based/rule-
study, our aim is to use Machine Learning to detect phishing based detection methods [4]. In this approach, the signature of
websites. We used the data from Kaggle consisting of 86 features
and 11,430 total URLs, half of them are phishing and half of this phishing attack is stored in a list. For phishing detection,
them are legitimate. We trained our data using Decision Tree this link can be the name of the URL addresses for example.
(DT), Random Forest (RF), XGBoost, Multilayer Perceptrons,
K-Nearest Neighbors, Naive Bayes, AdaBoost, and Gradient
Boosting and reached the highest accuracy of 96.6using XGBoost.
Index Terms—Phishing websites, legitimate websites, features,

machine learning, detection
I. I NTRODUCTION
With technological enhancement, Internet-oriented devices
and their applications gain huge popularity around the World.
Although they are considered as new technology, IoT devices
gain increased attention not only for their security issues, but
also other for internet-connected computer security issues, lots
of works have been tried to solve them, and machine learning
algorithms are mostly preferred in their implementations [1- Fig. 1. Phishing Life Cycle.
3]. Phishing is a well-known assault that deceives people into
viewing harmful content and revealing personal information. Fig. 1. Phishing Lifecycle Although this approach is effi-
Most phishing webpages use the same website interface and cient for the implementation and also detection time, it has
universal resource location (URL) as legitimate websites. An a great disability in the detection of new attacks. Therefore,
intelligent strategy to safeguard consumers from cyber-attacks new research focused on this model of detection by defining
is in high demand. We present a URL detection technique the normal model and trying to discriminate the abnormal
based on machine learning approaches in this research. To de- ones. In this approach generally, traditional machine learning
tect phishing websites using URLs, a recurrent neural network models or in some cases deep learning models are preferred
approach is used. As phishing attempts become more common, [5-7]. Additionally, some authors emphasize by only looking
our study intends to improve cipher attack detection rates by at the name of the web page is not sufficient. Therefore, the
providing high detection accuracy and low false-negative and content should be investigated by using some text-based tools
false-positive rates. The false-negative rate refers to phishing in content-based approaches [8, 9]. However, for making this
sites that are mistaken for valid sites, while the false-positive analysis, firstly the user should download the whole page,
rate refers to legitimate sites that are mistaken for phishing therefore detection time is increased. In this project, we will
sites Fig. 1 depicts a simple explanation of phishing. Phishing follow multiple steps to test websites for being phishing or
begins when a victim visits a website and then clicks on an legitimate. First, we will collect the dataset from Kaggle.
Then we will use a python programming language in order
to define the functions for each attribute in the dataset that
978-1-6654-6835-0/22/$31.00 ©2022 IEEE
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on April 24,2023 at 04:32:13 UTC from IEEE Xplore. Restrictions apply.
is made to test different features of the website. Then we in at 99.18 percent, based on the findings, which led to this
will build models such as Support Vector Machine (SVM), conclusion being reached.
Multilayer Perceptrons (MLP), Random Forest Tree (RFT), In 2019, Yang et al. published a paper in which they de-
XGBoost, Naive Bayes (NB), K-Nearest Neighbors (KNN), scribed a multidimensional feature phishing detection (MFPD)
AdaBoost, Gradient Boosting, and Decision Tree (DT). Next, methodology that was based on a rapid detection method
we will train the model using the dataset from Kaggle. Finally, applying deep learning. The paper was titled ”Phishing Detec-
we will test the models using random URLs. The rest of tion Using Multidimensional Features.” Phishing attacks can
the paper is organized as follows. Section II presents the be carried out in a wide variety of guises, which served as
related effort on categorizing phishing URLs using a Literature the inspiration for this method [12]. They began by taking
Review. Section III explains the data collection and methods the URL that was provided to them and extracting character
in depth, whereas Section IV presents the test findings. Section sequence features from it so that they could use those features
V summarizes the current study’s conclusion and the aim for in a deep learning classification. At this point in the process,
future development. the method of dynamic category choice was put into place in
order to reduce the total amount of time necessary for detection
II. R ELATED W ORK (DCDA). The following step was to combine the statistical
Phishing websites are websites that look to be authentic but characteristics of URLs, the characteristics of webpage codes,
are meant to capture personal information from visitors. These the characteristics of webpage texts, and the outcomes of deep
websites attempt to trick people into providing their personal learning’s rapid categorization into multidimensional charac-
information. Due to the fact that phishing has emerged as a teristics. Performing this step allowed for verification of the
significant issue in recent years, a great deal of effort has been detection’s precision. For this project’s testing, a dataset that
put into developing methods to identify fraudulent websites. included millions of phishing and benign URLs was utilized.
The following is an overview of some of these documents. The final product achieves an accuracy level of 98.99 percent,
In the year 2020, Alsariera, Adeyemo, Balogun, and Alaz- while the rate of false positives is only 0.59 percent. This is a
zawi utilized a total of four distinct meta-learner models in or- significant achievement. According to the results of the study,
der to recognize phishing websites. The acronyms AdaBoost- the performance was adequate in each of the following three
Extra Tree (ABET), Bagging-Extra Tree (BET), Rotation areas: accuracy, rate of false positives, and speed.
Forest - Extra Tree (RoFBET), and LogitBoost-Extra Tree In the year 2020, Ali and Maleberry devised a novel method
were used to refer to these models respectively (LBET) [10]. for improving the accuracy of the identification of phishing
When compared to other ML-based models, the performance websites by utilizing a strategy known as particle swarm
of these models is superior when it comes to detecting phishing optimization. This allowed them to identify phishing websites
attacks. Phishing website datasets that included the most recent with a higher degree of precision (PSO) [13]. During the
features were used to fit the models, which allowed for the process of selecting PSO-based website features, the signif-
effectiveness of the models to be evaluated. The study was icance of a website feature in determining whether or not a
successful in achieving a high accuracy rate of almost 98 website is a phishing site and whether or not it is a legitimate
percent, along with a low falsepositive rate of 0.018 (when site is taken into consideration. When employing the PSO
using the ABET approach) and a low false-negative rate of method, a training dataset will have between 7 and 57 percent
0.033 from the (LBET method). According to the findings of of its unnecessary characteristics eliminated. The following
the investigation, each of the four processes that were carried methods were presented for learning: back-propagation neural
out functioned extraordinarily well. network (BPNN), support vector machine (SVM), k-nearest
Deep learning served as the foundation for Tang and neighbor (KNN), decision tree (C4.5), random forest (RF), and
Mahmoud’s framework, which they presented in 2021 for naive Bayes classifier (NB). In comparison to the standalone
identifying phishing websites. They implemented their solu- machine learning models, the algorithms that made use of the
tion in the form of a browser plug-in (the Google Chrome PSO approach produced results that were superior in terms of
extension). The user receives real-time results from this plugin accuracy as well as TPR, TNR, FPR, and FNR improvements.
regarding the phishing website risk, and a warning is displayed This was achieved by making use of a significantly reduced
if phishing website detection occurs [11]. This approach, number of the website attributes that were previously utilized
which consisted of whitelist filtering, blacklist interception, in the identification of phishing websites.
and machine learning (ML) prediction, was devised with the In 2018, Pham, Nguyen, Tran, Huh, and Hong came up with
objectives of improving accuracy, reducing the number of false the concept for a developed neuro-fuzzy framework, and they
positives (FP), and speeding up the amount of time needed for called it Fi-NFN. This anti-phishing model is built into this
the calculation. The study incorporated a total of seven distinct framework, which protects users of fog from being the targets
datasets, each of which was derived from one of four distinct of phishing attempts [14]. The model, which was developed
pre-existing data sources. Support Vector Machine (SVM), specifically for implementation in fog networks, makes use
Logistic Regression, Random Forest, and RNN-GRU were the of the detecting capabilities offered by the blacklist/whitelist
models that were put to use in this investigation. The RNN- and web-structure approaches. The model is able to recognize
GRU model was found to have the highest accuracy, coming phishing without the need for IF-THEN rules because, in
addition to recognizing phishing URLs in real-time, it also tion.” CASE takes into account the spoofing nature of phishing
contains design input features that are able to recognize and encompasses the feature space that corresponds to it. It
phishing websites. Phishing can therefore be identified by also ensures the discrimination and generalization of features
the model. The input features consist of three Web traffic and provides assistance at the feature level for more effective
features as well as three helpful URL features known as phishing detection. In addition to this, it is suggested that
PrimaryDomain, SubDomain, and PathDomain (PageRank, a multistage detection model be utilized, one that possesses
AlexaReputation, and GoogleIndex). The approach that was accurate recognition and efficient filtering. It is intended that
applied resulted in improved detection performance in addition a generated dataset will be utilized, and this dataset will consist
to an increase in the accuracy of phishing detection when of websites that were obtained from a wide variety of different
compared to other approaches that were already in use. This sources. These websites will offer a wide variety of languages,
comparison was made using the approach that was applied. levels of content quality, and brands, and there will be a
According to the findings, the Fi- NFN approach is superior significant number of both good and poor examples available.
to other methods in terms of its ability to maintain accuracy, According to the findings of the strategy that was suggested,
consistency, and efficiency. the model improves detection outcomes by cutting down on
In the year 2021, Abdullateef and a few other researchers the amount of time it takes to carry out its operations and
[15] proposed Functional Tree (FT)-based meta-learning mod- by improving its overall performance. Self-attention CNN is
els as a method for identifying phishing websites. Within the a CNN and multi-head self-attention coupled deep learning
scope of this research project, an empirical investigation of FT strategy that has a high success rate in identifying phishing
was carried out with the purpose of improving the identifica- websites. It is anticipated that Xiao, Xiao, Dian Yan Zhang,
tion of phishing websites. FT combines a decision tree with a and a few other writers will present it in 2021 [17]. First,
linear function through the process of positive induction. This in order to guarantee that the dataset used for training is
results in a decision tree that has multivariate decision nodes representative of the real world, we generate phishing URLs
and leaf nodes that use discriminant functions to generate with GAN. The next step in the process of identifying phishing
predictions. Through the utilization of empirical research on websites involves the creation of a brand new deep network
FT, the objective of the study was to develop methods that are that makes use of CNN and multi-head self-attention. Even
more effective at identifying phishing websites. The strategy
that was taken aims to implement the FT algorithm and its III. M ETHODOLOGY
variants to detect benign and phishing websites, implement A. Dataset
Bagging, Boosting, and Rotation Forest Meta-learners to im- The dataset used in this project is collected from Kaggle.
prove FT performance and conduct a comparison between the It consists of 11,430 URLs. The dataset is divided into
proposed methods and the existing methods of phishing. The 50legitimate URLs and 50contains a total of 87 features. It is
results of testing and analyzing these algorithms demonstrated divided into 3 classes: 56 URL features, 24 features extracted
that FT and its variants are superior to other models in the from the page, and 7 external features. The detailed feature
majority of situations. This conclusion was reached based on explanation is listed in table 1. Some feature values, such as
the findings. In comparison to other methods, the FT has a the existence of and (), are binary; moreover, 0 represents
lower rate of false positives while maintaining a higher level legitimate and 1 represents phishing websites. Other features
of accuracy overall. The presented method has demonstrated are represented in decimals. After a specific threshold, those
that meta-learners are an intelligent algorithm, which enables featured websites turn from legitimate to phishing websites. In
the construction of phishing website detection models that are order to make sure the dataset is in a good shape, we checked
more accurate and consistent. for null values to delete them. There is one target feature in
The effectiveness of the machine learning-based largescale the dataset. This target feature is binary, in which 1 represents
phishing detection method that had been proposed by Liu, phishing and 0 represents legitimate websites. After checking
Gang, Jin, and Wang in the year 2021 was demonstrated the dataset and understanding it, we divided the dataset to train
through the use of two different types of experiments: com- and test it. The ratio of training to testing used was 75:25
parative experiments on various features and detection models percent. The next step is to train the models on the dataset,
on CASE and a one-year phishing discovery experiment in which is explained in part B.
the real web environment. These were the two types of
experiments that were used to demonstrate the effectiveness B. Classifiers
of the method. The concept of machine learning served as the This project used the above dataset in order to compare
foundation for this approach. the performance of nine models. These models are Sup-
The authors proposed that CASE is an acronym that refers port Vector Machine (SVM), Multilayer Perceptrons (MLP),
to ”Counterfeiting, Affiliation, Stealing, and Evaluation,” and Random Forest Tree (RFT), XGBoost, Naive Bayes (NB),
that it encompasses both machine learning and deep learning KNearest Neighbors (KNN), AdaBoost, Gradient Boosting,
strategies that are based on a constructed and complex dataset. and Decision Tree (DT).
This was presented in [16]. [CASE] is an acronym that In order to locate a hyperplane in an N-dimensional space,
refers to ”Counterfeiting, Affiliation, Stealing, and Evalua- the Support Vector Machine (SVM) algorithm is utilized.
In this context, N refers to the total number of features. The most commonly applied to problems involving regression,
decision points that classify the data points are represented classification, and ranking. Because it produces speed and
by the hyperlanes that were found. The number of features performance that are optimized, we decided to use XGBoost.
dictates the size of the hyperlane, which in turn is determined The Bayes theorem serves as the foundation for the Naive
by the total number of features. Given that there are 87 Bayes method, which is a supervised learning approach to the
different features, the situation at hand is rather challenging. problem of dealing with classification issues. The majority of
The data points that are located closer to the hyperlane are its applications involve the classification of text and images
what are known as support vectors. Because of the role they using extensive data sets for training. The Naive Bayes Clas-
play in determining the location of the hyperlane, they play a sifier is a straightforward classification method that has been
crucial role in the construction of the SVM. The objective of shown to be both effective and efficient. It contributes to the
the SVM is to achieve a margin that is as large as possible development of rapid machine learning models that are able
between the data points and the hyperlane [21]. to make accurate predictions in a timely manner.
The goal of the Support Vector Machine (SVM) algorithm is One of the only calculations in Machine Learning to be
to locate the optimal line or decision boundary for classifying based on an Administered Learning strategy is called the K-
n-dimensional space into classes. This will allow additional Nearest Neighbor calculation. The K-NN calculation pre-
data points to be easily positioned in the appropriate category supposes that there is a similarity between the most recent
in the future. case and data and the cases that are accessible. It then places
One kind of neural network is called Multilayer Percep- the new case in the category that is most comparable to the
trons (MLP for short). The MLP has three layers: the input categories that are available. You may also hear it referred to
layer, the hidden layer, and the output layer. The input layer as Adaptive boosting. It is an algorithm for machine learning
is the one that gets the signal to begin the training, the hidden that makes use of the Ensemble Method. This method chooses
layer, which is made up of many layers, contains the main the most accurate prediction from among all of the classifiers.
engine for the MLP, and the output layer is the one that is in The Adaboost algorithm can be broken down into three distinct
charge of the required task, which in this case is determining phases. First, it constructs a model and then allocates the
whether or not the website in question is a phishing site. An weight to each of the data points that are available. After that,
algorithm known as back propagation learning is used to train it gives incorrectly categorized points a higher weight than
the neurons that make up an MLP [22]. other points. Last but not least, in the following model, all
The number of features that can be used to identify phishing of the points that have a higher weight have been prioritized
websites is equal to the number of units that are present in the as being more important. Adaboost will not stop training the
input layer of the MLP. On the other hand, the number of models until an acceptable level of error has been achieved.
classes that can be created using these features is equal to the A Gradient Boosting classifier is a collection of ma-
number of units that are present in the output layer. MLP is chine learning calculations that create a robust predictive
utilized to find solutions to problems that are not linear, and it demonstration by combining a great number of rudimentary
is suitable for our investigation because we have a significant learning models into a single cohesive whole. When doing
number of characteristics. gradient boosting, choice trees are typically used as part of the
RFT stands for Random Forest Tree, which is an algorithm process. Calculations that are performed to boost performance
for machine learning that can be applied to problems involving play a significant role in the management of the inclination
classification and regression. It is part of the supervised learn- fluctuation trade-off. Unlike bagging calculations, which only
ing technique and is based on the model of ensemble learning. control for tall change in a show, boosting controls both
Ensemble learning refers to the process of combining different aspects (bias and variance), and is therefore considered to be
types of classifiers in order to solve difficult problems. Because more compelling. Bagging calculations only control for the tall
of this, RFT is effective at enhancing the overall performance changes in a show. Decision trees are a form of supervised
of the model. When training the dataset, Random Forest is also learning that can be applied to a variety of problems, in-
utilized because, in comparison to other algorithms, it requires cluding those involving classification and regression; however,
significantly less time. There are a total of 5 steps involved in most circumstances, they are best suited for resolving
in the RFT process. To begin, a sample of the information classification issues. This is a tree-structured classifier, with the
contained in the training set is chosen at random (K). Second, internal nodes representing the characteristics of the dataset,
construct decision trees based on the data points you’ve chosen the branches representing the decision rules, and each leaf
(subsets). Third, determine the value of N for the decision node representing the outcome of the classification.
trees that will be constructed. Fourth, do steps 1 and 2 again.
Five new data points are added to the decision tree that was C. Evaluate the models
determined to be the best [23]. In this stage we evaluate the models by entering the dataset
Extreme Gradient Boosting is the abbreviation for XG- with predictor variables to each model, then the models will
Boost, which refers to the library of machine learning algo- predict the targeting variable according to the prediction results
rithms known as a gradient-boosted decision tree (GBDT). and we will compare it with real values. In Fig. 2 there is a
Parallel tree boosting is available through XGBoost. It is learning model of the proposed work.
TABLE I
C OMPARISON O F METHODS
ML Model Train Accuracy Test Accuracy

Decision Tree 0.979 0.937
Random Forest 0.992 0.964
XGBoost 0.996 0.964
MLP 0.825 0.811
KNN 0.931 0.832
Naive Bayes 0.734 0.738
AdaBoost 0.954 0.953
Gradient Boosting 0.993 0.964
SVM 0.600 0.596
Fig. 2. Learning model from prposed system.
IV. EXPERIMENTAL RESULTS algorithms used were XGBoost, Gradient Boosting, Random
If you want to validate a machine learning model so that Forest Tree, Adaboost, Decision Tree, K-Nearest Neighbors,
it may be used in classification tasks, accuracy is perhaps the Multilayer Perceptrons, Naive Bayes, and Support Vector
most well-known way to do it. It is widespread application Machine. Between the used algorithms, XGBoost had the
can be attributed, in part, to the fact that it can be carried out highest test accuracy of 96.4 , recall of 96.3, and precision
with a reasonable amount of simplicity. It is not difficult to of 96.5 .
understand and not difficult to put into practice. Accuracy is a Our results show that XGBoost is the best algorithm for
helpful criterion to employ for evaluating the performance of detecting phishing websites. With such high accuracy, our
a model for simple scenarios, and it can be derived based on methods used for detecting phishing websites make our project
Equations 1 and 2 below. A metric known as accuracy, which better than previous projects. We intend to increase the accu-
is utilized while dealing with classification challenges, enables racy rate by increasing the dataset size in our future work.
one to ascertain the proportion of accurate predictions that may Therefore some big data may be used for the implementation.
be made. In order to calculate it, we must first take the total On the other side use of Big Data needs the use of Deep
number of forecasts and divide that number by the number of learning approaches as mentioned in [24] and to decrease the
predictions that turned out to be accurate. The accuracy of the training time there is a need to use parallel execution of the
suggested model can be determined by following the steps in code, especially with the use of GPU technologies [25].
Table III.
R EFERENCES
N umberof correctpredictions
Accuracy = (1)
T otalnumberof predictions [1] Leon Reznik, ”Computer Security with Artificial Intelligence, Machine
Learning, and Data Science Combination,” in Intelligent Security Sys-
tems: How Artificial Intelligence, Machine Learning and Data Science
TP + TN Work For and Against Computer Security , IEEE, 2022, pp.1- 56, doi:
Accuracy = (2) 10.1002/9781119771579.ch1.
TP + TN + FP + FN [2] O. K. Sahingoz, U. Cekmez and A. Buldu, ”Internet of Things
Where TP: True Positives, FP: False Positives, TN: True (IoTs) Security: Intrusion Detection using Deep Learning” 2021, Jour-
nal of Web Engineering, 2021, pp. 1721–1760, vol. 20, iss. 6, doi:
Negatives, FN: False Negatives. 10.13052/jwe1540-9589.2062.
This project has applied the prediction model of the pro- [3] R. Yetis and O. K. Sahingoz, ”Blockchain Based Secure Communication
posed phishing website by utilizing various points of view, for IoT Devices in Smart Cities,” 2019 7th International Istanbul Smart
Grids and Cities Congress and Fair (ICSG), 2019, pp. 134-138, doi:
which are related to the classification technique. The purpose 10.1109/SGCF.2019.8782285.
of this application is for experimental evaluation. I measured [4] A. Awasthi and N. Goel, ”Generating Rules to Detect Phish-
the accuracy of predicting which URL website is phishing ing Websites Using URL Features,” 2021 1st Odisha International
Conference on Electrical Power Engineering, Communication and
by using a variety of algorithms. The XGBoost classifier Computing Technology(ODICON), 2021, pp. 1-9, doi: 10.1109/ODI-
proved to be the most effective in giving a high rate of testing CON50556.2021.9429003.
accuracy, which was 96.4 percent. [5] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Detection of Phishing
Websites by Using Machine Learning-Based URL Analysis,” 2020
V. CONCLUSIONS AND FUTURE WORK 11th International Conference on Computing, Communication and Net-
working Technologies (ICCCNT), 2020, pp. 1-7, doi: 10.1109/ICC-
In this project, we used machine learning methods in order CNT49239.2020.9225561.
to decrease phishing websites’ success rates by detecting them. [6] L. Tang and Q. H. Mahmoud, ”A Deep Learning-Based Framework for
Phishing Website Detection,” in IEEE Access, vol. 10, pp. 1509-1521,
We selected our dataset from Kaggle with a 50:50 ratio for 2022, doi: 10.1109/ACCESS.2021.3137636.
training and testing, 11,430 URLs, and 89 features. The dataset [7] M. Korkmaz, O. K. Sahingoz and B. Diri, ”Feature Selections for
used differs from another dataset by the high number of the Classification of Webpages to Detect Phishing Attacks: A Sur-
vey,” 2020 International Congress on Human-Computer Interaction,
features, which helped increase the accuracy rate. We used Optimization and Robotic Applications (HORA), 2020, pp. 1-9, doi:
9 machine learning algorithms for training our dataset. The 10.1109/HORA49412.2020.9152934.
[8] E. Kocyigit, M. Korkmaz, O.K. Sahingoz, B. Diri, “Real-Time Content-
Based Cyber Threat Detection with Machine Learning”. In: Abraham,
A., Piuri, V., Gandhi, N., Siarry, P., Kaklauskas, A., Madureira, A.
(eds) Intelligent Systems Design and Application, 2021, ISDA 2020.
Advances in Intelligent Systems and Computing, vol 1351. Springer,
Cham. https://doi.org/10.1007/978-3-030-71187-0129.
[9] U. Ozker and O. K. Sahingoz, ”Content Based Phishing Detection with
Machine Learning,” 2020 International Conference on Electrical Engi-
neering (ICEE), 2020, pp. 1-6, doi: 10.1109/ICEE49691.2020.9249892.
[10] Y. A. Alsariera, V. E. Adeyemo, A. O. Balogun and A. K. Alazzawi,
”AI Meta-Learners and Extra-Trees Algorithm for the Detection of
Phishing Websites,” in IEEE Access, vol. 8, pp. 142532-142542, 2020,
doi: 10.1109/ACCESS.2020.3013699.
[11] L. Tang and Q. H. Mahmoud, ”A Deep Learning-Based Framework for
Phishing Website Detection,” in IEEE Access, vol. 10, pp. 1509-1521,
2022, doi: 10.1109/ACCESS.2021.3137636.
[12] P. Yang, G. Zhao and P. Zeng, ”Phishing Website Detection Based on
Multidimensional Features Driven by Deep Learning,” in IEEE Access,
vol. 7, pp. 15196-15209, 2019, doi: 10.1109/ACCESS.2019.2892066.
[13] W. Ali and S. Malebary, ”Particle Swarm Optimization-Based Feature
Weighting for Improving Intelligent Phishing Website Detection,” in
IEEE Access, vol. 8, pp. 116766-116780, 2020, doi: 10.1109/AC-
CESS.2020.3003569.
[14] C. Pham, L. A. T. Nguyen, N. H. Tran, E. -N. Huh and C. S.
Hong, ”Phishing-Aware: A Neuro-Fuzzy Approach for Anti-Phishing
on Fog Networks,” in IEEE Transactions on Network and Ser-
vice Management, vol. 15, no. 3, pp. 1076-1089, Sept. 2018, doi:
10.1109/TNSM.2018.2831197.
[15] Abdullateef O. et al., “Improving the phishing website detection using
empirical analysis of Function Tree and its variants”, Heliyon, vol 7,
Issue 7, 2021, e07437, https://doi.org/10.1016/j.heliyon.2021.e07437.
[16] Dong-Jie Liu, Guang-Gang Geng, Xiao-Bo Jin, Wei Wang, An ef-
ficient multistage phishing website detection model based on the
CASE feature framework: Aiming at the real web environment,
Computers Security, vol 110, 2021, 102421, ISSN 0167-4048,
https://doi.org/10.1016/j.cose.2021.102421.
[17] Xi Xiao, Wentao Xiao, Dianyan Zhang, Bin Zhang, Guangwu Hu,
Qing Li, Shutao Xia, Phishing websites detection via CNN and mul-
tihead self-attention on imbalanced datasets, Computers Security, Vol
108,2021,102372,ISSN 0167-4048.
[18] A.V. Ramana, Rao, K.L. Rao, R.S. Stop-Phish: an intelligent phishing
detection method using feature selection ensemble. Soc. Netw. Anal.
Min. 11, 110 (2021). https://ezproxy.iku.edu.tr:2444/10.1007/s13278-
021-00829-w.
[19] SatheeshKumar, M., Srinivasagan, K.G. UnniKrishnan, G. A
lightweight and proactive rule-based incremental construction ap-
proach to detect phishing scam. Inf Technol Manag (2022).
https://ezproxy.iku.edu.tr:2444/10.1007/s10799-021-00351-7.
[20] A. Hannousse, Salima Yahiouche, Towards benchmark datasets for
machine learning based website phishing detection: An experimen-
tal study, Engineering Applications of Artificial Intelligence,Volume
104,2021,104347,ISSN 0952-1976.
[21] Gandhi, R. (2018, July 5). Support Vector Machine - intro-
duction to machine learning algorithms. Medium. Retrieved May
18, 2022, from https://towardsdatascience.com/support-vector-machine-
introductionto- machine-learning-algorithms-934a444fca47.
[22] Multilayer Perceptron. Multilayer Perceptron - an overview
— ScienceDirect Topics. (n.d.). Retrieved May 18,
2022, from https://www.sciencedirect.com/topics/computer-
science/multilayerperceptron.
[23] Machine learning random forest algorithm - javatpoint.
www.javatpoint.com. (n.d.). Retrieved May 18, 2022, from
https://www.javatpoint.com/machine-learning-random-forestalgorithm
[24] S. C. Kalkan and O. K. Sahingoz, ”Deep Learning Based Classification
of Malaria from Slide Images,” 2019 Scientific Meeting on Electrical-
Electronics Biomedical Engineering and Computer Science (EBBT),
2019, pp. 1-4, doi: 10.1109/EBBT.2019.8741702.
[25] S. I. Baykal, D. Bulut and O. K. Sahingoz, ”Comparing deep learning
performance on BigData by using CPUs and GPUs,” 2018 Electric Elec-
tronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT),
2018, pp. 1-6, doi: 10.1109/EBBT.2018.8391429.

Detecting Phishing Websites Using Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Phishing Websites Using Machine Learning

Uploaded by

Copyright:

Available Formats

Detecting Phishing Websites Using Machine

Safa Alrefaai Ghina Özdemir Afnan Mohamed

Index Terms—Phishing websites, legitimate websites, features,

978-1-6654-6835-0/22/$31.00 ©2022 IEEE

ML Model Train Accuracy Test Accuracy

You might also like