Detection of Phishing Websites Using An Efficient Feature-Based Machine Learning Framework

Neural Computing and Applications (2019) 31:3851–3873
https://doi.org/10.1007/s00521-017-3305-0(0123456789().,-volV)(0123456789().,-volV)
ORIGINAL ARTICLE
Detection of phishing websites using an efficient feature-based

machine learning framework
Routhu Srinivasa Rao1 • Alwyn Roshan Pais1
Received: 19 July 2016 / Accepted: 27 December 2017 / Published online: 6 January 2018
The Natural Computing Applications Forum 2018
Abstract
Phishing is a cyber-attack which targets naive online users tricking into revealing sensitive information such as username,
password, social security number or credit card number etc. Attackers fool the Internet users by masking webpage as a
trustworthy or legitimate page to retrieve personal information. There are many anti-phishing solutions such as blacklist or
whitelist, heuristic and visual similarity-based methods proposed to date, but online users are still getting trapped into
revealing sensitive information in phishing websites. In this paper, we propose a novel classification model, based on
heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of
existing anti-phishing techniques. Our model has been evaluated using eight different machine learning algorithms and out
of which, the Random Forest (RF) algorithm performed the best with an accuracy of 99.31%. The experiments were
repeated with different (orthogonal and oblique) random forest classifiers to find the best classifier for the phishing website
detection. Principal component analysis Random Forest (PCA-RF) performed the best out of all oblique Random Forests
(oRFs) with an accuracy of 99.55%. We have also tested our model with the third-party-based features and without third-
party-based features to determine the effectiveness of third-party services in the classification of suspicious websites. We
also compared our results with the baseline models (CANTINA and CANTINA?). Our proposed technique outperformed
these methods and also detected zero-day phishing attacks.
Keywords Cyber-attack Phishing Anti-phishing Heuristic technique Machine learning algorithms Random Forest
Oblique Random Forest
1 Introduction from the users. It is carried out with a mimicked page of a

legitimate site, directing online user into providing sensi-
In this cyber world, most of the people communicate with tive information. The term phishing is derived from the
each other either through a computer or a digital device concept of ‘fishing’ for victims sensitive information [1].
connected over the Internet. The number of people using The attacker sends a bait as mimicked webpage and waits
e-banking, online shopping and other online services has for the outcome of sensitive information. The replacement
been increasing due to the availability of convenience, of ‘f’ with ‘ph’ phoneme is influenced from phone
comfort, and assistance. An attacker takes this situation as phreaking, a common technique to unlawfully explore
an opportunity to gain money or fame and steals sensitive telephone systems. The attacker is successful when he
information needed to access the online service websites. makes a victim to trust the fake page and gains his/her
Phishing is one of the ways to steal sensitive information credentials related to that mimicked legitimate site.
Anti-Phishing Working Group (APWG) is a non-profit
organization which examines phishing attacks reported by
& Routhu Srinivasa Rao
routhsrinivas.cs15fv13@nitk.edu.in its member companies such as iThreat Cyber Group,
Internet Identity (IID), MarkMonitor, Panda Security and
Alwyn Roshan Pais
alwyn@nitk.ac.in Forcepoint. It analyzes the attacks and publishes the reports
periodically (quarterly and half-yearly). It also provides
1
Information Security Research Lab, National Institute of statistical information of malicious domains and phishing
Technology Karnataka, Surathkal, India
123
3852 Neural Computing and Applications (2019) 31:3851–3873
attacks taking place in the world. According to the latest websites along with security indicators such as green
APWG (2016) Q4 report [2], 1,220,523 number of phishing padlock, HTTPS connection etc. In Fig. 2, we can see that
attacks were recorded in 2016 and has been confirmed to be an attacker is able to get HTTPS and green padlock for his
the highest than in any year since it began monitoring in phishing website. Hence, HTTPS connection is no longer
2004. The statistics from 2010 to 2016 can be seen in guaranteed to decide legitimacy of a website.
Fig. 1. In Malware phishing, attacker inserts a malicious soft-
According to [3], online users fall for phishing due to ware such as Trojan horse into a compromised legitimate
various factors such as: site without the knowledge of a victim. According to
APWG(2016) report [4], 20 million new malware samples
1. Inadequate knowledge of computer systems.
were captured in the first quarter of 2016. The vast majority
2. Inadequate knowledge on security and security indi-
of the late malwares are multifunctional, i.e., they steal the
cators. (In the current scenario, even the indicators are
information, make the victims system as a part of botnet or
being spoofed by the phishers.)
download and install different malicious software without
3. Inadequate attention to warnings and proceeding
client’s notification [5].
further by undermining the strength of existing tools.
Spear Phishing [6] target a specific group of people or
(abnormal behavior of toolbars)
community belonging to an organization or a company.
4. Inadequate attention to the visual deceptive text in
They send emails which pretend to be sent by a colleague,
URL and Website content. (For example, paypal text is
manager or a higher official of the company requesting
replaced as paypa1 in URL of address bar, copyright or
sensitive data related to the company. The main intention
title of the phishing website.)
of general phishing is financial fraud, whereas spear
Phishing attacks take place through various forms such phishing is a collection of sensitive information. Whaling
as email, websites and malware. To perform email phish- [6] is a type of spear phishing where attackers target bigger
ing, attackers design fake emails which claim to be arriving fish like executive officers or high profile targets of private
from a trusted company. They send fake emails to millions business, government agencies or other organizations.
of online users assuming that at least thousands of legiti- There are many anti-phishing techniques proposed in the
mate users would fall for it. literature to detect and prevent phishing attacks. We have
In website phishing, attacker builds a website which categorized these anti-phishing techniques into 4
looks like a replica of legitimate site and draws the online categories.
user to the website either through advertisements in other
websites or social networks such as Facebook and Twitter
etc. Some of the attackers are able to manage phishing
·105
12
No of attacks * 105
10
0
2010 2011 2012 2013 2014 2015 2016
Year
Fig. 1 Phishing attacks from 2010 to 2016 Fig. 2 Phishing website of PayPal with HTTPS protocol
123
Neural Computing and Applications (2019) 31:3851–3873 3853
• List-based techniques: Most of the modern browsers background of web page is slightly changed without
such as Chrome, Firefox and Explorer etc. follow list- deviating from visual appearance of legitimate site.
based techniques to block phishing sites. There are two • Machine learning-based techniques: Nowadays, most
types of list-based techniques such as whitelist and of the researchers are concentrating on the use of
blacklist. The whitelist [7] contains a list of legitimate machine learning algorithms (ML) applied on the
URLs which can be accessed by the browsers. The features extracted from the websites to detect phishing
browser downloads the website, only if the URL is attacks. These techniques are a combination of heuristic
present in the whitelist. Due to this behavior, even the methods and machine learning algorithms, i.e., dataset
legitimate websites which are not whitelisted are also used by the machine learning algorithms is extracted
blocked resulting in high false positives. The blacklist through heuristic methods. Some of the machine
[8, 9] contains phishing or malicious URLs which are learning algorithms are sequential minimum optimiza-
blocked by the browsers in downloading the webpages. tion (SMO), J48 tree, Random Forest (RF), logistic
Due to this behavior, the phishing sites which are not regression (LR), multilayer perceptron (MLP), Baye-
blacklisted are also downloaded by the browser result- sian network (BN), support vector machine (SVM) and
ing in high false negatives. These non-blacklisted AdaBoostM1 etc. As ML-based techniques are based
phishing sites are also called as Zero-day phishing on heuristic features, they are able to identify the zero-
sites [10]. A small change in the URL is sufficient to day phishing attacks which make them advantageous
bypass the list-based techniques. Frequent update of than list-based techniques. These techniques work
these lists is mandatory to counter the new phishing efficiently on the large sets of data. According to
sites. [20], it is possible to achieve more than 99% true
• Heuristic-based techniques [11–15]: These techniques positive rate and less than 1% false positive rate.
use features extracted from the phishing website to According to [21], ML-based classifiers are the classi-
detect phishing attack. Some of the phishing sites do fiers which achieved more than 99% accuracy. This
not have common features resulting in poor detection made us concentrate on using machine learning algo-
rate using this mechanism. As this approach does not rithms as classifiers. The other side of ML-based
use list-based comparison, it results in less false techniques is that their performance depends on the size
positives and less false negatives. But it has less of training data, feature set and type of the classifiers.
accuracy compared to list-based techniques as there is
Some of the existing techniques [22–24] use third-party-
no guarantee of existence of these features in all
based features such as search engine, WHOIS or page
phishing websites. An attacker can bypass the heuristic
ranking features to detect phishing attacks. These tech-
features once he knows the algorithm or features used
niques fail when attackers use compromised domains for
in detecting phishing sites thereby reaches his goal of
hosting their phishing pages. Figure 3 shows original
stealing sensitive information. This technique detects
website of compromised domain and a phishing page
zero-day phishing attacks which the list-based tech-
which is hosted on the compromised domain targeting HM
niques fail to detect.
Revenue and Customs website. As per the statistics of
• Visual Similarity-based Approach [16–19]: The main
APWG report 1H2014 [25], the life spans of phishing sites
objective of the phisher is to deceive the user by
have a median of less than 10 hours. This report also
designing an exact image of legitimate site such that the
specifies that half of the phishing sites went offline in less
user does not get any suspicion on the phishing site.
than a day, but most of the phishing pages using compro-
Hence, the anti-phishing techniques compare suspi-
mised domains will be alive for more than a day in the
cious website image with legitimate image database to
Internet which can be seen in Fig. 4. From the figure, we
get the similarity ratio, used for the classification of
can also note that phishing site is still alive after 9 days.
suspicious websites. The website is classified as
Another APWG report [2] recorded that there is an increase
phishing when the similarity score is greater than a
in 65% of total phishing attacks compared to 2015. The
certain threshold else it is treated as legitimate. The
statistics from Fig. 1 and the above APWG reports reveal
limitations of this technique are as follows: (1) Image
that the existing techniques are not able to prevent the
comparison of suspicious website with entire legitimate
phishing attacks. Hence, there is a need for designing better
database store takes more time complexity. (2) More
mechanisms for detecting phishing websites.
space to store legitimate image database. (3) Web page
Heuristic techniques exploit the unique characteristics of
with animated website compared with phishing website
phishing websites and consider them as a common
leads to the low percentage of similarity that leads to
parameter for detecting phishing websites. As heuristic
high false negative rate. This technique fails, when the
techniques have the advantage of detecting zero-day
123
Fig. 3 Screenshot of compromised domain and a Phishing page
Fig. 4 Lifespan of phishing site in PhishTank
123
attacks and have been used by major browsers and anti- of various oblique Random Forest algorithms (oRF). Sec-
virus softwares in real time [21], we have driven our tion 3 explains the proposed work of our model. Imple-
approach toward the heuristic mechanism. This technique mentation details are discussed in Sect. 4 and evaluation of
has a risk of misclassification of legitimate sites into various experiments is given in Sect. 5. Discussion is
phishing sites (false positives) since characteristics are not presented in Sect. 6, and limitations of the proposed
guaranteed to always exist in phishing sites. In this paper, approach are given in Sect. 7. Finally, the conclusion and
we attempt to add new heuristic features (in addition to the future work are given in Sect. 8.
existing features) with machine learning algorithms to
reduce the false positives in detecting new phishing sites.
We have also made an attempt to identify the best machine 2 Related work
learning algorithm to detect phishing sites with high
accuracy than the existing techniques. We have used eight As our proposed method fits into machine learning-based
machine learning algorithms (J48 tree decision, sequential technique, we describe some of the existing machine
minimal optimization (SMO), logistic regression (LR), learning-based techniques that are used in the detection of
multilayer perceptron (MLP), Bayesian network (BN), phishing websites.
Random Forest (RF), support vector machine (SVM) and Pan and Ding [24] proposed an anti-phishing approach
AdaBoostM1) to classify the websites as legitimate and based on the extraction of website identity from DOM
phishing. Based on the experimental observations, Random objects such as title, keyword, request URL, anchor URL
Forest outperformed the others. The choice of considering and main body etc. When a phishing site fakes a legitimate
these machine learning algorithms is based on the classi- site, it claims false identity demonstrating abnormal
fiers used in the recent literature [24, 26–29]. behavior compared to the legitimate site. They experi-
Oblique Random Forests (oRF) performs better than mented with ten features using supporting vector machine
orthogonal Random Forests [30]. Hence, we compared RF classifier for the classification of data. This approach fails
with novel oRF models proposed in [30–32] for identifying when the attacker hosts a phishing page on the compro-
the best Random Forest classifier. From the experimental mised domain.
comparison, we observed that principal component analy- CANTINA [23] technique operates on the textual con-
sis Random Forest (PCA-RF) performed the best out of all tent present in the website. The term frequency and inverse
oRFs. It should be noted that orthogonal Random Forest is document frequency (TF-IDF) algorithm is applied on the
simply referred as Random Forest throughout this paper. textual website content to extract high TF-IDF score words
To the best of our knowledge, our work is the first to use and fed to the search engine to identify a phishing site. It
oblique Random Forests (oRF) algorithm for classification also includes additional heuristic features and simple for-
of phishing websites. ward linear model for classification of websites. As the
Our paper makes the following three research contri- performance of this approach depends on the textual con-
butions. First, we propose five novel heuristic features tent of the suspicious site, it fails when the textual content
(HF3, HF4, HF6, HF7, HF8) capturing key characteristics is replaced with images. It also has a performance problem
of phishing sites using the source code of a website. Sec- with respect to time since it depends on the third-party
ond, we have experimentally shown that third-party ser- service search engine.
vices like the search engine, WHOIS database and Alexa Miyamoto et al. [28] evaluated the performance of
page ranking complements the heuristic techniques to machine learning algorithms in classifying the websites as
detect phishing. The results of our proposed technique legitimate or phishing website. Authors incorporated all the
illustrate that combining third-party-based features with features of CANTINA model to detect the phishing web-
proposed features gives the classification accuracy of sites. They employed nine machine learning algorithms for
99.3% with RF and 99.55% with PCA-RF as classifiers. the classification of websites. Among all of them, the
Third, we propose a feature (HF7) which detects phishing AdaBoostM1 algorithm performed the best with a lowest
pages when the entire website is replaced with a single error rate of 14.15%. This technique has the same limita-
image. One major advantage of this method is that it does tion as of CANTINA approach, i.e., it fails when the tex-
not require any legitimate image store for identifying single tual content is replaced with an image.
image-based phishing whereas traditional visual similarity Xiang et al. [27] proposed a comprehensive feature-
anti-phishing techniques need image data store to compare based approach which uses machine learning algorithms
the logos or entire webpage screenshot. such as Bayesian networks for classification of phishing
The rest of the paper is organized as follows: Section 2 websites. This technique is called CANTINA? which is an
reviews a set of anti-phishing techniques based on machine extension to CANTINA including eight novel features such
learning-based approach and also gives the brief overview as HTML DOM features, third-party services and search
123
engine-based features. It has divided entire datasets into This technique has an advantage of adapting its neural
two parts: First, Unique testing phish dataset resulted in network according to the change of phishing characteris-
92% detection accuracy on feeding to machine learning tics. For better performance, this model has to be retrained
algorithm. Second, near duplicate testing phish dataset frequently with a fresh and up to date training dataset.
resulted in 99% detection accuracy on feeding to machine Chiew et al. [35] presented a visual similarity-based
learning algorithm. The authors have used six learning technique where the logo of a suspected website is
algorithms to train the dataset and calculate the perfor- extracted using machine learning technique and fed to
mance of the model. The Bayesian network algorithm Google image search to retrieve target identity. The sus-
performed the best in comparison with other algorithms. pected site is termed as phishing when actual domain is not
This technique fails to detect the phishing sites when the equal to domain resulted from Google image search result.
website textual content is replaced by an image. This technique depends on success ratio of logo extraction
He et al. [26] proposed a heuristic-based technique through machine learning. This technique fails in detecting
which uses the combination of CANTINA, Anomaly, and phishing attacks on logo absence websites.
PILFER methods to detect phishing sites. The authors Moghimi and Varjani [36] proposed a rule based method
extracted twelve features from the suspicious website and and developed it as a browser extension to detect phishing
fed them to the support vector machine for classification of sites. This method is based on the features extracted from
phishing and legitimate websites. As this technique adop- website content, used support vector machine algorithm
ted the features either from CANTINA or [24], it faces the (SVM) to classify the phishing websites. As this approach
same limitation of CANTINA technique. It also suffers is solely dependent on website content, it fails when
from high time complexity in loading webpage due to the attacker redesigns the website with modified visual content
use of third-party services like search engine, page ranking such that brand name is misspelled or trimmed, changing
etc. all links to its local domain. This technique only targets
Zhang et al. [29] proposed a model consisting of URL phishing sites imitating legitimate bank websites.
and website content to detect Chinese phishing e-business Whittaker et al. [20] proposed a blacklist technique
websites. Authors incorporated fifteen domain specific which extracts common features of known phishing web-
features for detecting phishing attacks in Chinese e-busi- sites, and uses these features for the detection of phishing
ness websites. They evaluated the model with four different sites. They used Random Forest classifier for the detection
machine learning algorithms for classification of phishing of phishing sites. The authors claim that RF performs the
sites. Sequential minimal optimization (SMO) algorithm best with noisy live phishing data and they are able to
performed best in detecting phishing site with an accuracy achieve a false positive rate lower than 0.1%. As this
of 95.83%. The limitation of this approach is that it targets technique depends on the blacklist of phishing data, it fails
only one domain of phishing sites, i.e., it may not work to detect phishing sites which are not blacklisted.
efficiently with non-Chinese websites. Aggarwal et al. [37] proposed a technique called
Gowtham and Krishnamurthi [33] proposed a heuristic- PhishAri which detects phishing URLs in the tweets. They
based technique combined with machine learning classifier used URL, WHOIS, tweet and network-based features with
to detect phishing site. They have used private whitelist Random Forest classifier for the detection of malicious
and login form identifier filters to reduce the false positive URLs. As this technique solely depends on the URL text
rate and computation cost in the system. Support vector and not on the content of a website, it might fail to give
machine algorithm is used as a classifier for classification effective results when the phishing URL is hosted on the
of phishing websites. The authors have extracted 15 fea- compromised domain.
tures from URLs characteristics, URL identity, keyword Table 1 provides a summary of these works in com-
identity set and domain name analysis to form a feature parison to our work with respect to five features. These
vector. This feature vector is fed to SVM to calculate the include detection of phishing sites that replaces textual
performance of a model. The proposed model has achieved content with an image (Image-based phishing), detection of
true positive rate of 99.65% and false positive rate of phishing sites that contains most of the hyperlinks directed
0.42%. As this technique depends on the textual content of toward a common page (Common Page Detection),
a website, it might not work effectively when the entire detection of phishing sites that are hosted in any language
website content is replaced with a single image. (language independence), detection of phishing sites that
Mohammad et al. [34] presented an intelligent model for consists of maximum number of broken links (Broken
detecting phishing websites based on self-structuring neu- links) and classifiers used for classification of phishing
ral networks. The authors extracted seventeen features sites.
from URL and website content which are used as an input Attackers are using common anchor links in the entire
to artificial neural networks for classification of websites. website such that anti-phishing techniques [14, 24, 26, 29]
123
Table 1 Summary of related work in comparison with our work

Technique Year Image-based Common page based Language Broken Classifier Accuracy
phishing phishing independence links (%)
Pan and Ding [24] 2006 No No Yes No SVM 90

CANTINA [23] 2007 No No No No Linear model 94
Miyamoto et al. [28] 2008 No No No No Adaboost 85.64
Whittaker et al. [20] 2010 No No Yes No Random 95.92
Forest
Xiang et al. [27] 2011 No No No No Bayesian 99.61
network
He et al. [26] 2011 No No No No SVM 96.50
PhishAri [37] 2012 No No Yes No Random 92.52
Forest
Zhang et al. [29] 2014 No No No No SMO 95.83
Gowtham and 2014 No No Yes No SVM 99.61
Krishnamurthi [33]
Mohammad et al. [34] 2014 No No Yes No Neural 92.48
networks
Chiew et al. [35] 2016 Yes Yes Yes Yes SVM 93.4
Moghimi and Varjani [36] 2016 No No Yes No SVM 98.65
Our Work 2017 Yes Yes Yes Yes Random 99.55
Forest
depending on NULL anchor (#) links are bypassed. Simi- best when encountered with noisy data due to nature of
larly, they also insert broken anchor links of local domain diversity of training data.
in the entire website to bypass the anti-phishing techniques A recent survey [40] explores recent developments in
[24, 26, 29, 34, 38] that depend on foreign anchors feature. ensemble classification and regression methods and their
Using our proposed features, we are able to detect phishing applications. The survey includes traditional, state-of-the-
sites which are not being detected by existing techniques art ensemble methods and categorized them into (a) con-
with a very high accuracy and hence can be used for better ventional such as bagging, boosting, and Random Forest
classification of phishing websites. etc. (b) deep learning-based ensemble methods. They
The major classifiers of phishing detection models used provided different variations of ensemble methods and also
in the literature are support vector machine, neural net- explored extended versions of Random Forests such as
works, sequential minimum optimization and Adaboost oblique Random Forest with ridge regression, support
etc. From Fig. 1, it is evident that the phishing attacks are vector machine, principal component analysis (PCA), and
still on the rise despite the best efforts to detect them. linear discriminant analysis (LDA) etc.
Hence, the goal of our work is to find the best machine From [41], it is observed that Random Forest performs
learning classifier that can improve the accuracy of the best among all 179 classifiers on the real world clas-
phishing detection system. sification problem. RF is also being used for detecting
From [39], it is noted that ensembles perform better than phishing emails [42, 43], phishing tweets [37] and mali-
any single classifier. An ensemble of classifiers is a set of cious posts on Facebook [44]. Whittaker et al. [20] used
classifiers whose individual decisions are combined either Random Forest classifier for maintaining Google’s phish-
by weighted or unweighted voting to classify the testing ing blacklist pages automatically. Based on
data. The accuracy of the ensemble method depends on the [20, 37, 42, 43], we chose the Random Forest classifier for
diversity of set of classifiers (i.e., they make different errors classification of data using our novel proposed features. In
at new data points). The authors also reviewed various our proposed technique, we have also used different obli-
methods to generate ensembles that can be applied to dif- que Random Forests (oRF) for phishing detection. In the
ferent machine learning algorithms. The methods for con- next subsection, we provide a brief overview of different
structing ensembles include manipulating the training Random Forest algorithms.
dataset, input features, output targets, injecting randomness
etc. The Authors concluded that Random trees performs
123
2.1 Random Forests significant performance in classification of data. This

method outperforms RF when trained with spectral and
Random Forest [45] works as a collection of decorrelated numerical data. But with the factorial data RF outperforms
decision trees where its prediction class is the mode of proposed oRF.
classes of individual decision trees. Random Forest con- The analysis of these works in the literature made us
structs the multitude of decision trees based on random drive toward the use of Random Forest algorithm for
selection of data and random selection of variables. The training our model. Note that none of these oblique Ran-
author classified Random Forest into orthogonal and obli- dom Forests have been used for detection of phishing
que Random Forest. In the orthogonal Random Forest, each websites. This is the first attempt to use these techniques
node of the tree creates axis-aligned boundary decision, for detection of phishing websites.
i.e., it splits the data with thresholds on individual feature.
In the latter case, oRF splits the data by randomly oriented
hyperplanes, i.e., hyperplanes are not necessarily parallel to 3 Proposed work
any axis of feature space.
Zhang and Ponnuthurai [31] proposed a new method The architecture of the proposed method is given in Fig. 5.
that improves the accuracy of Random Forest by increasing A set of webpage URLs is fed as input to the feature
diversity of each tree in the forest. Instead of transforming extractor, which extracts the most significant features for
the whole data, authors transformed data feature space at the phishing detection. The extracted features are used to
each node to another space such that transformation output train the model with different machine learning algorithms
at each node is totally different. The goal of authors is to for the classification of websites.
increase the diversity of each tree by transforming the
feature space into high dimensional feature space and then 3.1 Feature extractor
random trees are constructed with transformed feature
space. They introduced two types of Random Forests. First, We have implemented a java program which uses Jsoup1
PCA-based RF which uses principal component analysis library to download and parse the source code from a given
(PCA) to transform the data at each node. Second, LDA- URL. By crawling the source code of a website, we extract
based RF which uses linear discriminate analysis (LDA) to the features which are used in detection of phishing web-
transform the data at each node. They also proposed new sites. We have classified the extracted features into three
RF with ensemble of feature space including PCA space, categories such as:
LDA space and original space. The experimental results
• URL Obfuscation features
showed that RF with ensemble of feature spaces outper-
• Third-Party-based features
formed conventional RF.
• Hyperlink-based features
Zhang and Ponnuthurai [32] have proposed oblique
random decision tree ensemble where multi surface prox-
imal SVM (MPSVM) [46] is employed on the decision 3.1.1 URL Obfuscation features
trees for splitting the nodes at each tree, i.e., decision
hyperplane at each internal node is calculated by the These are the features which are found in URL of a
MPSVM. Tikhonov and axis-parallel regularization phishing website, trying to mislead the user with various
strategies are proposed and NULL space regularization is methods. Before going on to understand the features, we
adopted from [47] to handle the small size problem. From describe the anatomy of a typical URL.
the experimental results, it is shown that MPSVM-based URL is an acronym for uniform resource locator which
decision trees outperformed RF. specifies the address of a resource such as HTML docu-
Menze et al. [30] proposed a method to improve the ment, image, video clip, cascading style sheet etc. on the
accuracy of RF by splitting the node of each tree with an Web. It consists of three pieces: Protocol used to access the
oblique angle. The decision boundary at each internal node resource (HTTP), name of the machine hosting the
is identified using linear discriminate models. These node resource (hostname) and path of the resource (pathname).
models are based on partial least squares regression (pls), The typical structure of an URL is as follows:
ridge regression, ridge_slow regression, linear support
vector machine (svm), logistic regression (log) and random
hyperplane (rnd). The node models are used to enhance the
strength of individual trees by performing oblique splits at
each internal node. This method has been shown to deliver
1
http://jsoup.org/..
123
Fig. 5 Architecture of the proposed system
<Protocol>://<Subdomain>.<Primary domain>.<gTLD>.<
impression as legitimate URL to the user. For example,
ccTLD>/<Path domain>
consider a phishing URL: http://www.signin.ebay.
where gTLD is generic top level domain and ccTLD is com/@www.phishing.com/abs/def/index.html., user
country code Top Level Domain. For example, in URL gets fooled by clicking on this link and is directed to
http://www.reg.signin.nitk.com.pk/secure/login/web/index. phishing.com website. When the browser encounters
php., we have, protocol as http, Subdomain as reg.signin, @ symbol in URL, it ignores everything prior to @
primary domain as nitk, gTLD as com, ccTLD as pk, symbol and downloads the real address which follows
Domain as nitk.com.pk, Hostname as reg.signin.nitk.- @ symbol.

com.pk and pathname as secure/login/web/index.php. 1 if @ present in URL
The existing features which identify visual deceptive UF2 ¼ ð2Þ
0 otherwise
text techniques in the URL of a website are described
below. In particular, feature UF2, UF4 and UF5 are 3. UF3-Lengthy URL to hide suspicious domain: This
adopted from [23, 48, 49]; feature UF3 adopted from feature counts the number of characters in a hostname
[48–50] and feature UF1 is a variant of one feature in to detect the status of a website. This feature is similar
CANTINA [23]. to feature UF1 except that phisher uses long URL by
adding more number of characters to the URL to hide
1. UF1-Dots in Hostname: This feature counts the the actual phishing domain of the website. Larger the
number of dots in the hostname of a URL to detect number of characters, higher the possibility of website
the status of a page as phishing or legitimate site. to be a phishing site.
Usually, URL contains a primary domain, subdomain,
gTLD or ccTLD or both. The attackers add the domain UF3 ¼ Total length of hostname in URL ð3Þ
of a legitimate website as the subdomain of a phishing
4. UF4-Using IP Address: This feature is a binary feature
website. Thus, users fall for phishing by seeing the
which checks the presence of IP address in a given
legitimate domain in the phishing URL. For example,
URL. If present, the feature is set to 1 else set to 0.
let us consider a phishing URL targeting ebay website:
Most of the legitimate sites do not use IP address as an
http://www.reg.signin.ebay.secure.web.login.user.phish.
URL to download a webpage. Hence when the user
com/index.html.. Here, ebay is added in subdomain of
encounters IP address, he can almost be sure that
phishing website (phish.com). The attacker includes
someone is trying to steal his/her sensitive information.
more number of dots to hide the actual domain of
phishing site. Larger the number of dots in the host- 1 if IP present
UF4 ¼ ð4Þ
name, higher the probability of website to be a phishing 0 otherwise
site.
5. UF5-Presence of HTTPS (HTTP plus SSL): This
UF1 ¼ Number of dots in Hostname ð1Þ feature checks for the presence of HTTPS in a given
URL. If the URL contains HTTPS then the feature is
2. UF2-URL with @ symbol: This feature checks for the
set to 0 else set to 1. Most of the legitimate websites
presence of special character @ in the URL. This
use HTTPS connection when there is a need of
feature is a binary feature which is set to 1 for the
sensitive information to be forwarded. This feature is
presence of @ in URL else set to 0. Phishers add
not enough for detecting phishing sites because the
special symbol @ after the target domain followed by
attackers are also able to get fake HTTPS connection
actual phishing domain to the URL giving an
to their phishing websites as shown in Fig. 2.
123

0 if HTTPS present of website to get unique identity terms of a suspicious
UF5 ¼ ð5Þ website. However, CANTINA fails when TF-IDF
1 otherwise
extracts wrong keywords as website identity keywords.
Hence, we skipped TF-IDF technique to calculate
3.1.2 Third-party service-based features website identity terms and considered Title, Descrip-
tion and Copyright as website identifiers in our
In this section, we explain the existing third-party service- proposed model. This feature queries the search engine
based features used in our detection model. These features with Title or Copyright or Description of a suspicious
have been significant in detecting phishing websites. In website and classifies the status of website based on the
particular, feature TF1 is adopted from [23, 48, 49]; feature presence of its domain in top N search results. We
TF2 is adopted from [48–51]; and feature TF3 is adopted considered N as 10 based on the experimental results
from CANTINA? [27]. of CANTINA?. Due to the deprecation of Google
search API,4 we considered Bing search engine for the
1. TF1-Age of Domain: This feature calculates the age of evaluation of this feature.
a website’s domain which is extracted from WHOIS2 8
database. The rationale behind this feature is that most < 0 if local domain matched with top N
TF3i ¼ search results
of the phishing websites are hosted on newly registered :
1 otherwise
domains resulting absence of entries in WHOIS
database or age of domain as less than a year. But ð8Þ
this is not always guaranteed in phishing sites because where i = 1, 2, 3 for title, copyright and description,
phishers might also host their phishing page on a respectively
compromised domain as shown in Fig. 3. The impor-
tance of this feature depends on the number of
compromised domains used by the phishers. 3.1.3 Hyperlink-based features

0 if DNS Record absent
TF1 ¼ ð6Þ In this section, we explain the features which are extracted
Age otherwise
from hyperlinks located in the source code of a website. A
2. TF2-Page Rank: As phishing sites have a short link or hyperlink is an element in an electronic document
lifespan and less traffic compared to legitimate sites; that connects from one web source to another. The web
these sites are either not recognized or least popular source can be an image, program, HTML document and an
rated by page ranking services like Alexa.3 This element within an HTML document.
feature calculates the page rank of a suspected website To imitate the behavior of trusted website, phishing
in Alexa Database. To obtain the PageRank score from websites would impersonate the links or hyperlinks, which
the Alexa PageRank system, we just send a normal point to the trusted website domain. This implies that
HTTP request to the (http://data.alexa.com/data?cli= phishing pages regularly contain hyperlinks that point to a
10&url=’’?domain) and use XML parser to get the foreign domain.
Alexa ranking. In this paper, we consider links pointing to the local
domain as local links and links pointing to the foreign
0 if Page Rank absent
TF2 ¼ ð7Þ domain as foreign links. For any legitimate site, the number
Page Rank otherwise
of local links would be greater than the number of foreign
3. TF3-Website in Search Engine Results: Keywords links. Hence, we consider this as one of the feature to
which uniquely defines a website are fed to a search classify a suspected site as a legitimate or phishing site.
engine to display the related URLs as results. The Similarly, we extract various URL features such as script
webpage is classified as legitimate if its base domain is resource URLs, CSS URLs, Images source URLs to clas-
matched with any of the domains of top N search sify the phishing site from the legitimate site. In particular,
results otherwise classified as a suspicious site. HF1, HF2 and HF5 are adopted from [24, 48]; and features
CANTINA used this feature for detection of a phishing HF3, HF4, HF6, HF7, HF8 are the novel ones we proposed.
attack, by feeding high ranked term frequency inverse 1. HF1-Frequency of domain in anchor links: This
document frequency (TF-IDF) terms to Google Search feature examines all the anchor links in source code
engine. TF-IDF algorithm is applied on the entire text of a website and compares the most frequent domain
with the local domain of a website. If both the domains
2
https://www.internic.net/whois.html.
3 4
http://www.alexa.com/siteinfo/page-rank-calculator.com. https://developers.google.com/web-search/.
123
are similar then the feature is set to 0, i.e., website is

<html lang=‘‘en">
classified as legitimate else the feature is set to 1. The <head>...</head>
rationale behind the assumption of this feature is that <body>
attackers design phishing sites sticking to similar links ...
of legitimate domains to reduce the load of designing. <a href="http://tutorials.phishingsite.com/ion/
index.html">forgot password</a>
For any legitimate site, the number of local links would <a href="http://tutorials.phishingsite.com/ion/
be greater than foreign links of a website resulting index.html">Contact us</a>
most frequent domain coincide with the local domain. <a href="http://tutorials.phishingsite.com/ion/
This makes the difference in behavior of legitimate and index.html">Help</a>
...
phishing sites and is considered as one of the features </body>
to detect phishing sites. </html>
8
< 0 if most frequent domain equals
HF1 ¼ local domain ð9Þ This kind of scenario in phishing sites results in high
: common page detection ratio. But in legitimate sites, the
1 otherwise
anchor links mostly point to different filenames which
results in less common page detection ratio. Note that, if
there are no anchor links present in a website then the
feature is set to 0 else it is set to Common page detection
2. HF2-Frequency of domain in CSS links, image links, ratio value.
and Script links: This feature is similar to HF1, except 8
< alcw if al [ 0
that most frequent domain is calculated for the request HF3 ¼ alnw
nw
ð11Þ
URLs of CSS file, image and script file from \src [ :
0 otherwise
and \link [ tags of the source code. We have
calculated most frequent domain for links of CSS, where alcw is frequency of most common link and
image and script files individually and also in alnw is total number of anchor links in a website. From
combined files. The most frequent domain of request the experimental results, it is observed that this feature
URLs is compared with the local domain to check for is significant in classifying the phishing websites. The
the legitimacy of a website. If the comparison is dataset upon training with this feature individually
successful, then the feature is set to 0 and the website resulted in accuracy of 83.73% as shown in Fig. 11
is classified as legitimate else the feature is set to 1 and 4. HF4-Common Page detection ratio in Footer: This
the website is classified as suspicious. feature calculates the ratio of frequency of most
8 common anchor link to the total number of links in
< 0 if most frequent domain equals
HF2 ¼ local domain ð10Þ footer section of the website. This feature is similar to
: HF3 feature, except that it is concentrated on footer
1 otherwise
section of the website.
8
< alcf if al [ 0
nf
HF4 ¼ alnf ð12Þ
:
0 otherwise
3. HF3-Common page detection ratio in Website: This
feature calculates the ratio of frequency of the most where alcf is frequency of most common anchor link in
common anchor link to the total number of links in a footer and alnf is total number of anchor links in footer
website. The rationale behind this feature is that section of a website.
phishers may redirect a few or all of the links in login 5. HF5-Null Links ratio in website: We consider anchor
page to a common page. They might skip imitating the links with null value or anchor links starting with null
links of a legitimate site and rather stick to a single value as Null links. This feature calculates the ratio of
URL with the local domain. We believe that attackers Null links to the total number of anchor links in a
take less effort in designing phishing site by mimicking website. The main intention of an attacker is to make
the login page. Hence, they do not very often use a the online user stay on the same page until he submits
different anchor link and redirects most of the anchor his sensitive information. Hence, when a user clicks on
links to a common page. For example, sample code of any link present in the login page, he is redirected to
three anchor links of a phishing site pointing to a the same login page. This is being done by adding
common URL is shown below. following code as an anchor link.
123
8
<a href="#"> or <a href="#content"> < alnullf if alnf [ 0
HF6 ¼ alnf ð14Þ
:
8 0 otherwise
< alnullw
if alnw [ 0 where alnullf is number of null links in footer and alnf is
HF5 ¼ al ð13Þ
: nw total number of anchor links in footer section of a
0 otherwise
website.
where alnullw is number of null links and alnw is total 7. HF7-Presence of anchor links in HTML body: This
number of anchor links in a website. feature checks for the presence of anchor links in the
6. HF6-NULL Links ratio in Footer: This feature extracts body section of source code of a website. If absent,
ratio of Null links in footer section of website to the then the feature is set to 1 else it is set to 0. The
total number of footer links in a website. In this rationale behind this feature is that there exists at least
feature, we identify links with null value in footer one anchor link such as signin, signup, forgot
section rather than on entire HTML, which reduces password or others in legitimate site’s login page
false positive ratio and time complexity compared to whereas in phishing websites, attackers replace entire
HF5 feature. We might deal with less number of links legitimate site content with a single background image.
rather than all links in the website. Most of the We termed these kinds of phishing sites as Image-
legitimate websites do not often use Null links in the based Phishing. Attacker attains the above case by
footer section of a website, but the phishing sites taking a screenshot of targeted legitimate site and
usually insert null links in the footer section to make embeds it as background page for their phishing site.
the user stay on the same page until the sensitive On top of the screenshot, they insert input fields such
information is submitted by the user. Figure 6 shows as username or email, password and Login Button
the Screenshot of phishing site along with NULL following the same styles of targeted legitimate
footer link. domain.
Figure 7 explains how the phishing is carried out by
replacing textual content with a single background
image. Figure 7a shows the background image used by
Fig. 6 Phishing site with null links, footer and copyright section
123
Fig. 7 a Background image used for phishing, b Phishing site without background image, c fully developed phishing site along with its source
code
the attacker. The location of background image can be image from the source code of phishing site. Figure 7c
seen in address bar https://gator3231.hostgator.com/ shows the fully developed phishing site with back-
*cheaks/bost/FTEROO09K/img/bg.PNG. Figure 7b ground image and input fields embedded into the body
shows the form imitating the same style of legitimate of website. The background image location can be seen
site. This figure is taken by removing the background in the inspect code of the website. The input fields are
123
properly aligned by adjusting the position of fields with comparison of RF with various oblique Random Forest for
z-index, absolute position, height, width, top and left identifying best RF classifier. The reason behind the high
dimensions. For better understanding, sample source performance of Random Forest is due to the ensemble of
code of phishing site is given below: weak learning models (tree predictors) with diversity of
training data among them. Note that combining multiple
<html lang=‘‘en">
<head>...</head>
weak classifiers into one aggregated classifier leads to
<body style="background-image: url(img/bg.PNG);
background-repeat: no-repeat;"> 
better classification performance than that of any individual
<form method="post" action="LOGIN.php" style="z-
index: 1; width: 395px; height: 232px; classifier [52]. It is because the probability of making
position: absolute; top: 208px; left: 211px"
>
<div style="z-index: 1; width: 340px; height: 40
mistakes at the same place in all of the trees is very low.
px; position: absolute; top: 53px; left: 24
px">
<input style="z-index: 1; position: absolute;
Hence, it achieves high accuracy than individual weak
top: -3px; left: 0px; width: 338px; height:
38px" type="email">
</div>
classifier.
... 
</form>
</body>
4 Implementation
</html>
Hence, HF7 can be defined as given below.

Given a URL of suspected website, we have to decide
0 if aln ¼ 0
HF7 ¼ ð15Þ whether it is a legitimate or phishing site. We used a java
1 otherwise
library called Jsoup5 to download the source code of static
where aln is number of anchor links in the website. pages and parsed them to extract significant features used
8. HF8-Broken links ratio: This feature extracts ratio of in detecting status of a website. The reason behind
Not found links to the total number of links in a choosing Jsoup is that it can handle non-well-formed
website. In legitimate sites, when all the links are HTML very effectively and also it has its own API pro-
connected either they return 200 Ok HTTP status code viding functionality to select elements of document object
indicating server has accepted the request or sends 404 model (DOM).The other parsers like tagSoup, Jtidy,
status code indicating page not found in that server. NekoHTML and HTML Cleaner are lenient to certain
Legitimate sites rarely return status code as 404, degree with non-well-formed HTML.
whereas in phishing sites, attackers include links with
random filenames leading to not found pages. Hence, 4.1 Tools used
we have used this feature to classify the legitimate sites
and phishing sites. We have implemented a java program to extract all the
( features from the websites in java platform standard edition
nfl
if ln [ 0 8. We have collected the URLs of live phishing sites from
HF8 ¼ ln ð16Þ
0 otherwise PhishTank and legitimate sites from Alexa database. We
have written a snippet of code to automate the process of
where nfl is number of not found links and ln is total extracting features and store in a file. This file of features is
number of links in a website. given as input to the classifier for calculating the effec-
tiveness of our features in detecting the phishing sites. We
3.2 Machine learning algorithms have simulated all the machine learning algorithms in R
language using RWeka, Kernlab and ObliqueRF packages.
To evaluate the performance of our feature set, we used RWeka contains interface code which connects to Weka, a
eight machine learning algorithms in training our model data mining tool which provides implementation of
including J48 tree, Random Forest (RF), sequential mini- machine learning algorithms. Kernlab is an extensible
mal optimization (SMO), logistic regression (LR), multi- package which consists of kernel-based machine learning
layer perceptron (MLP), Bayesian network (BN), support methods. We used kernlab for the implementation of SVM.
vector machine (SVM) and AdaBoostM1 (AM1). We We employed the methods presented by Menze et al. [30]
performed comparison of all the classifiers with a primary and Zhang and Ponnuthurai [31, 32] for comparison of RF
goal of choosing best classification model among them. with various oRF in R and Matlab respectively.
These algorithms have been used by previous researchers
in [27, 29] for training their model. All the above machine
learning algorithms are implemented in R using RWEKA
and Kernlab package. From the extensive experiments, we
found that Random Forest performed the best among the
other algorithms consistently. We also performed 5
http://jsoup.org/..
123
4.2 Dataset used FP

FPR ¼ ð18Þ
FP þ TN
To calculate the performance of our model, we collected
• Specificity: Calculates the rate of correctly classified
2119 phishing sites from PhishTank and 1407 legitimate
legitimate sites out of total number of legitimate sites.
sites from Alexa Database, as shown in Table 2. We
For a better model, this value should be higher.
divided the dataset into two sets: training set and test set.
75% of original dataset is taken as training set which is TN
Specificity ¼ ð19Þ
used to train the model and remaining 25% of the dataset is TN þ FP
taken as test set which is used to evaluate the model. • False negative rate (FNR): Calculates the rate of
incorrectly classified phishing sites out of total number
of phishing sites. For a better model, this value should
5 Evaluation be lower.
FN
We conducted five experiments to assess the performance FNR ¼ ð20Þ
of our model trained with various machine learning clas- FN þ TP
sifiers. All the experiments were conducted with the same • Precision (Pre): Calculates the ratio of correctly
dataset of 3526 (1407?2119) websites. Each experiment classified phishing sites out of total number of classified
was repeated ten times with a randomly selected training phishing sites.
set. The average value of each metric was considered as TP
final value such that biasing is avoided. In Experiment 1, Pre ¼ ð21Þ
TP þ FP
we evaluated our model with all the features including
third-party-based features. In Experiment 2, we evaluated • Accuracy (ACC): Calculates the rate of phishing and
our model excluding third-party service-based features. In legitimate sites that are correctly classified out of total
Experiment 3, we evaluated only third-party service fea- number of websites.
tures used in predicting phishing websites. In Experiment TP þ TN
4, we compared our model containing all the features with ACC ¼ ð22Þ
PþN
CANTINA and CANTINA?. In Experiment 5, we com-
pared orthogonal Random Forest with various oblique • Error rate (ERR): Calculates the rate of phishing and
Random Forest algorithms. We used following metrics for legitimate sites that are incorrectly classified out of
evaluating our model. total number of websites.
We considered condition positive (P) as Phishing and FP þ FN
ERR ¼ ð23Þ
condition negative (N) as Legitimate. Correctly classified PþN
phishing is termed as hit or true positive (TP), correctly
classified legitimate as true negative (TN), incorrectly The details of parameters we set for the algorithms in our
classified legitimate to phishing is termed as false positive experiments are given in Table 3. We used same parame-
(FP) or false Hit, incorrectly classified phishing to legiti- ters for the Random Forest and different variants of oblique
mate is termed as false negative (FN) or false miss. Random Forests so that comparison is conducted without
bias toward any algorithm.
• Sensitivity: Calculates the rate of correctly classified
phishing sites out of total number of phishing sites. For 5.1 Experiment 1: Evaluation of heuristics
a better model, this value should be higher. features including third-party service
TP features
Sensitivity ¼ ð17Þ
TP þ FN
In this experiment, we assessed heuristic features including
• False positive rate (FPR): Calculates the rate of
third-party service features with eight different classifiers.
incorrectly classified legitimate sites out of total
We tested the proposed system with a dataset of 3526
number of legitimate sites. For a better model, this
websites comprising of both legitimate and phishing
value should be lower.
Table 2 Dataset used in our

Type Source Sites URL
model
Legitimate Alexa’s top websites 1407 http://www.alexa.com/topsites
Phishing PhishTank database 2119 http://www.phishtank.com/developer
123
Table 3 Tuning parameters for the experiments

Learning algorithms Parameters
Random Forest & Oblique Random Forest (All types) Maximum depth: unlimited
Number of trees to be generated (ntree): 100
Number of attributes (mtry): 4
Number of Iterations: 10
J48 The confidence factor used for pruning: 0.25
The minimum number of instances per leaf: 2
Number of folds: 3
Logistic regression Maximum number of iterations: unlimited
Ridge value in the log-likelihood: 1.0E-8
Bayes network Estimator: Simple estimator (alpha: 0.5 )
Search algorithm: hill climbing algorithm (max no of parents: 2)
Multilayer perceptron Learning rate: 0.3
Momentum: 0.2
Number of Epochs: 500
Hidden layers: (attribs ? classes)/2
Validation set size: 0
Validation threshold: 0
Sequential minimal optimization Complexity: 1
Kernel: PolyKernel (Exponent: 1 , Cache: 250007)
Epsilon: 1.0E-12
Tolerance parameter: 0.001
AdaBoostM1 Classifier: decision stump
Number of Iterations: 10
Weight threshold: 100
Support vector machines Kernel: Radial basis kernel ‘‘Gaussian’’
Sigma: 0.05
SVM type: C-SVC
Epsilon: 0.1
Cost of constraints violation (C): 5
websites. We performed comparison of all the classifiers it is observed that RF classifier achieved least average
with a primary goal of choosing the best classification prediction error rate.
model among them. Random Forest algorithm performed
the best with an average prediction accuracy of 99.31%
followed by J48 tree decision algorithm with an accuracy 5.1.1 Evaluation of heuristics with the inclusion
of 98.98% as shown in Table 4. All the machine learning and exclusion of proposed features
algorithms have classified the websites with an accept-
able true positive rate which indicates the goodness of our Figure 9 shows the performance of the proposed system
proposed features in classifying the websites. Figure 8 with the inclusion and the exclusion of proposed features
shows prediction errors of eight classifiers. From the figure, (HF3, HF4, HF6, HF7, HF8) using Random Forest
Table 4 Evaluation of heuristics

Metrics RF J48 LR BN MLP SMO AdaBoostM1 SVM
features including third-party
service features Sensitivity 0.9944 0.9901 0.9597 0.9921 0.9581 0.9458 0.9807 0.9714
Specificity 0.9910 0.9895 0.9413 0.9825 0.9407 0.9220 0.9587 0.9430
Precision 0.9942 0.9930 0.9600 0.9883 0.9612 0.9476 0.9725 0.9609
ACC 0.9931 0.9898 0.9522 0.9882 0.9512 0.9363 0.9718 0.9594
ERR 0.0069 0.0102 0.0478 0.0118 0.0488 0.0637 0.0282 0.0406
123
5.2 Experiment 2: Evaluation of heuristics

6.37
excluding third-party service features
6
In this experiment, we assessed heuristic features excluding
5 4.78 4.88
third-party service features with eight different classifiers.
Percentage(%)
4.06 We tested the scheme with the same dataset used in

4
Experiment 1 (Sect. 5.1) to check the effectiveness of only
heuristics in detecting phishing websites. We performed
3 2.82
the comparison of all the classifiers, among them Random
Forest algorithm performed the best with an average pre-
2
diction accuracy of 93.19%, followed by J48 Tree decision
1.02 1.18 algorithm with an accuracy of 91.76% as shown in Table 5.
1 0.69 This experiment shows a drop of phishing detection
accuracy by 6.12% due to lack of third-party service fea-
RF J48 LR BN MLP SMO AM1 SVM tures. The results of this experiment demonstrate that third-
party service features have significant influence on the
Fig. 8 Comparison of error rates of all classifiers
performance of our model.
5.3 Experiment 3: Evaluation of third-party

ACC service features
Sensitivity
Specificity
Precision In this experiment, we considered only third-party service
features for prediction of phishing websites. The results are
99.44 tabulated in Table 6. From the results, it is evident that
99.42
99.4 third-party service features are significant in detecting
99.37
phishing websites. Among all classifiers, Random Forest
Percentage(%)
99.31 performed the best with an average prediction accuracy of

99.3
98.55% and an error of 1.45%. Comparing the results of
99.24
Experiment 1 with Experiment 3 using Random Forest
99.2 99.17
algorithm, there is a drop of 0.76% in accuracy. In future,
these values may reach higher once the attacker starts
99.1 hosting phishing pages on compromised legitimate
99.1 99.07 domains. However, the results illustrated that combining
third-party services features with heuristic features made
99 the classification more accurate.
With proposed features Without proposed features
Fig. 9 Performance of the proposed system with the inclusion and the 5.4 Experiment 4: Comparison of our work
exclusion of proposed features with other techniques
algorithm. From Experiment 1 results, it is evident that the In this experiment, we implemented existing works such as
Random Forest algorithm performs better than the other CANTINA and CANTINA? and tested them with the
algorithms. Therefore, we used the same Random Forest same dataset used in Experiment 1. We compared the
algorithm to study the effect of our proposed features in performance of our work with these techniques with an aim
detecting phishing websites. From Fig. 9, it is evident that of evaluating the performance of our features against the
the accuracy of phishing detection system has improved baseline models. This experiment demonstrates that our
with the inclusion of proposed features compared to the model offers more accuracy than these methods in detect-
system with the exclusion of proposed features. Hence, the ing phishing websites. Random Forest algorithm performed
proposed features would complement the detection of the best with an accuracy of 99.31% followed by J48
phishing websites with a significant accuracy. algorithm with an accuracy of 98.98% as shown in Table 7.
123
Table 5 Evaluation of heuristics

excluding third-party service
features Sensitivity 0.9427 0.9403 0.9351 0.9036 0.9265 0.9149 0.9079 0.9438
Specificity 0.9159 0.8825 0.8877 0.9017 0.8992 0.8868 0.8588 0.8893
Precision 0.9438 0.9250 0.9249 0.9339 0.9323 0.9235 0.9099 0.9286
ACC 0.9319 0.9176 0.9160 0.9027 0.9156 0.9036 0.8888 0.9222
ERR 0.0681 0.0824 0.084 0.0973 0.0844 0.0964 0.1112 0.0778
Table 6 Evaluation of our

model with only third-party
service features Sensitivity 0.9885 0.9830 0.9125 0.9826 0.9179 0.8999 0.9929 0.9321
Specificity 0.9810 0.9752 0.8253 0.9723 0.8171 0.8250 0.9548 0.8262
Precision 0.9875 0.9834 0.8859 0.9816 0.8827 0.8839 0.9704 0.8905
ACC 0.9855 0.9799 0.8775 0.9785 0.8772 0.8698 0.9776 0.8898
ERR 0.0145 0.0201 0.1225 0.0215 0.1228 0.1302 0.0217 0.1102
Table 7 Comparison of our

Classifiers Techniques Sensitivity Specificity Precision ACC ERR
work with other techniques
RF CANTINA 0.9154 0.8559 0.9057 0.8918 0.1082
CANTINA? 0.9889 0.9949 0.9967 0.9913 0.0087
Our Work 0.9944 0.9910 0.9942 0.9931 0.0069
J48 CANTINA 0.9210 0.8379 0.8958 0.8880 0.112
CANTINA? 0.9886 0.9896 0.9931 0.9890 0.011
Our Work 0.9901 0.9895 0.9930 0.9898 0.0102
LR CANTINA 0.8922 0.7973 0.87189 0.8550 0.145
CANTINA? 0.9248 0.8784 0.9177 0.9060 0.094
Our Work 0.9597 0.9413 0.9600 0.9522 0.0478
BN CANTINA 0.9109 0.8064 0.8782 0.8696 0.1304
CANTINA? 0.9914 0.98426 0.9881 0.9878 0.0122
Our Work 0.9921 0.9825 0.9883 0.9882 0.0118
MLP CANTINA 0.9143 0.8440 0.8981 0.8862 0.1138
CANTINA? 0.9454 0.9009 0.9343 0.9273 0.0727
Our Work 0.9581 0.9407 0.9612 0.9512 0.0488
SMO CANTINA 0.8922 0.7936 0.8661 0.8528 0.1472
CANTINA? 0.9273 0.8763 0.9197 0.9071 0.0929
Our Work 0.9458 0.9220 0.9476 0.9363 0.0637
AdaboostM1 CANTINA 0.8925 0.7987 0.8692 0.8549 0.1451
CANTINA? 0.9894 0.9526 0.9685 0.9745 0.0255
Our Work 0.9807 0.9587 0.9725 0.9718 0.0282
SVM CANTINA 0.9356 0.8128 0.8844 0.8871 0.1129
CANTINA? 0.9527 0.9011 0.9352 0.9320 0.068
Our Work 0.9714 0.9430 0.9609 0.9594 0.0406
5.5 Experiment 5: Comparison of RF with oRF We employed the methods of various oRF such as
MPRaF-P, MPRaF-T, PCA-RF and LDA-RF proposed by
In this experiment, we evaluated the performance of vari- Zhang and Ponnuthurai [31, 32]. MPRaF-P, MPRaF-T
ous oblique Random Forest classifiers [30–32] and com- means MPSVM-based Oblique Random Forest with axis-
pared with orthogonal RF to identify the best classifier for parallel split, Tikhonov regularization respectively and
detection of phishing sites. PCA-RF, LDA-RF are PCA-based RF, LDA-based RF.
123
We also employed the method proposed by Menze et al. observed that PCA-RF outperformed all of the various RF
[30] for simulating various oRFs such as (oRF-svm), with an accuracy of 99.55% followed by MPRaF-P with an
(oRF-ridge), (oRF-ridge_slow), (oRF-pls), accuracy of 99.43%. Figure 10 shows prediction errors of
(oRF-log) and (oRF-rnd). The decision hyperplanes various oRF classifiers. From the Fig. 10, it is evident that
in these oRFs are based on support vector machine, ridge PCA-RF and MPRaF-P performed better than orthogonal
regression (fast and slower versions), partial least squares RF with least average prediction error rate. It may be
regression and randomness respectively. All the variations observed that oRF with trained pls, ridge, ridge_slow, svm
of above mentioned oblique Random Forests and orthog- and rnd achieved less accuracy than RF. It is due to the fact
onal Random Forest were constructed with 100 decision that the oRF with above training models outperforms RF in
trees (ntree = 100) and tuned with mtry = 4, where mtry is case of heavily correlated and high dimensional data, but
subset of attributes needed for split. mtry is calculated as their performance falls below RF in case of factorial or
square root of total number of attributes. discrete data [30]. Note that our extracted phishing website
mtry ¼ maxðsqrtðncolðxÞÞ; 2Þ ð24Þ features are not heavily correlated and had maximum dis-
crete data.
Note that, all the models were trained with 75% of The usage and popularity of Random Forest is increas-
dataset and tested with remaining dataset. The experiments ing due to its flexibility and simplicity on complex data. It
are repeated ten times for each oblique Random Forest and overcomes the problem of overfitting and handles the
then the average value is considered for the comparison. sparse/missing data well.
The results are shown in Table 8. From the results, it is
Table 8 Comparison of RF with oRF

Metrics RF MPRaF-T MPRaF-P PCA-RF LDA-RF oRF-pls oRF-svm oRF-ridge oRF-ridge_slow oRF-log oRF-rnd
Sensitivity 0.9945 0.9672 0.9945 0.9945 0.9750 0.9874 0.9804 0.9891 0.9885 0.9925 0.9863
Specificity 0.9923 0.9865 0.9942 0.9961 0.9809 0.9364 0.9622 0.9424 0.9369 0.9407 0.9741
Precision 0.9891 0.9806 0.9918 0.9945 0.9723 0.9594 0.9748 0.9626 0.9591 0.9623 0.9817
ACC 0.9932 0.9785 0.9943 0.9955 0.9785 0.9668 0.9731 0.9705 0.9678 0.9720 0.9831
ERR 0.0068 0.0215 0.0057 0.0045 0.0215 0.0332 0.0269 0.0295 0.0322 0.0280 0.0169
3.5 3.32
3.22
3 2.95
2.8
2.69
2.5
Percentage(%)
2.15 2.15
2
1.5
1.1
1
0.68
0.57
0.5 0.45
-T
-P
ge
-svm
F
-RF
-rnd
ow
-log
-pls
A-R
-rid
RaF
RaF
ridg F-
RF
e sl
LDA
oRF
oRF
oR
oRF
oRF
oRF
PC
MP
MP
Fig. 10 Comparison of error rates of RF with different types of RF
123
6 Discussion same links pointing to images of legitimate site instead of

pointing to the local domain. This makes the phishing
The possible ways of bypassing our model that can be attack detection possible by the hyperlink-based feature.
made by the attackers are discussed in this section. The But with an extra effort, the attacker can bypass this feature
contribution of individual features in detecting phishing by using links pointing to its own domain.
sites is shown in Fig. 11. Sometimes legitimate sites also behave like phishing
We used Bing Search engine for comparing suspicious sites consisting of a large number of foreign links pointing
domain with top N search results, WHOIS database for to their mapped domains. These legitimate sites requests
calculating the age of domain, Alexa page ranking for images, scripts, style and other graphics files from their
identifying the rank of website as third-party service fea- mapped domains. For example, let us consider a legitimate
tures. These features make our method suffer from network website: https://www.paypal.com/in/webapps/mpp/home,
latency due to querying each service. which requests the scripts, images, CSS files from pay-
Alexa page rank feature fails when the attacker uses palobjects.com domain. These type of websites makes HF1
URL as a compromised domain. It also creates false pos- and HF2 features, producing high false positive rate. To
itive in classifying the newborn legitimate sites as phishing address the limitation of these features, we collected a set
sites. The failure is due to the short lifespan of phishing of legitimate sites and their mapped domains associated
site, whereas lifespan of reputed legitimate site would be at with each other in a list. Whenever a feature encounters a
least greater than a year. WHOIS-based feature also suffers foreign domain listed in a mapped domain, its associated
from the same problem of Alexa feature, i.e., it fails, when domain is compared with the local domain of a website.
the compromised domain is used for hosting a phishing site The mapped domain is considered as a local domain if the
resulting in greater domain age of the phishing page. comparison is successful otherwise treated as foreign
To bypass null links feature in a login page, phishers domain.
might insert the current URL of phishing page as a link The results of our experiments clearly demonstrate that
value instead of null (#) value. However, this case is third-party service features have a significant influence on
countered by the Common page detection ratio filter which the performance of a model. By incorporating these fea-
is based on the most frequent URL. They can also bypass tures, our model outperformed existing techniques. We
the null links or Common page detection ratio filter by also evaluated the contribution of individual features in
copying the login page in many folders and then links are detection of phishing websites with an aim of identifying
pointed to these files. top performing features. The statistics of each individual
The main intention of attackers is to design a phishing feature is calculated with an accuracy metric using Random
website with the least effort. Hence, the attacker uses the Forest algorithm as shown in Fig. 11.
100
96.97
90
83.73
Percentage(%)
80.8 81.13
80 77.76
76.37 77.05
75.81
74.7
71.96
70 67.79
66.24
64.68 64.34
62.91 63.33
59.49 60.44
60
1
TF1
TF2
8
UF
UF
UF
UF
UF
TF3
TF3
TF3
HF
HF
HF
HF
HF
HF
HF
HF
Fig. 11 Performance of individual features
123
It is observed that some of the features outperformed image database or prior knowledge such as web history.
other features with over 70% accuracy including length of Our method identifies the phishing sites based on the fea-
hostname (UF3), age of domain (TF1), Page Ranking tures extracted from URL, Website content and third-party
(TF2), search engine title (TF31), search engine copyright services using machine learning algorithms.
(TF32), common page detection ratio in website (HF3), Our model also detects phishing sites which imitate
common page detection ratio in footer (HF4), null links in legitimate sites by replacing the website content with an
website (HF5), presence of anchor links in HTML body image, which most of the anti-phishing techniques fail to
(HF7) and broken links ratio feature (HF8). Out of five detect. Our model detects zero-day phishing sites which the
proposed novel features, four of them result as top per- list-based techniques fail to detect. With the help of these
forming features in detecting phishing sites. It is also rich set of heuristic features, our model is able to achieve
observed that each feature used for training our model has high detection rate of 99.55% and low false positive rate of
an accuracy of over 60% which reveals the goodness of our 0.45% using oblique Random Forest algorithm.
features in detecting phishing websites. But it should be In the future, we intend to include additional heuristics
noted that the statistics are dependent on the quality and to detect phishing attacks containing embedded objects
quantity of training data which is further used in training such as flash, iframes and HTML files etc. Additional
the model for the classification of websites. heuristics, excluding third-party services for increasing the
efficiency of our model will be investigated. This achieves
the reduction of dependency on third-party services,
7 Limitations thereby reducing latency in the detection process.
This section explains the limitation of our model. Firstly, Acknowledgements The authors would like to thank Ministry of
Electronics & Information Technology (Meity), Government of India
our model purely depends on the quality and quantity of the for their support in part of the research. We also would like to thank
training set. Hence, training the model with a small dataset anonymous reviewers for their detailed and useful comments. We also
and having more duplicate entries in it may not give thank Zhang & Ponnuthurai for their correspondences regarding the
effective results in phishing detection. use of their oRF algorithms.
Our model uses third-party services for classification of
websites resulting in dependency on the speed of third- Compliance with ethical standards
party services. This problem can be eliminated with high
Conflict of interest The authors declare that they have no conflict of
Internet speed and alternative sources. Broken links feature interest.
extraction has a limitation of more execution time for the
websites with more number of links, for example, Wells
Fargo, USAA etc. References
Our model does not work on the websites which use
captcha checking before loading the websites because 1. Ollmann G (2004) The phishing guide. Next Generation Security
Jsoup connects to the URL to get the source code directly, Software Limited. http://www-935.ibm.com/services/us/iss/pdf/
phishing-guide-wp.pdf
does not involve handling of captcha verification. But we 2. APWG (2016) Phishing attack trends reports, fourth quarter
observed that there is no captcha verification involved in 2016. http://docs.apwg.org/reports/apwg_trends_report_q4_2016.
the phishing websites; however, we suspect that attackers pdf. Accessed 03 Mar 2017
might include captchas in phishing websites in the future. It 3. Dhamija R, Tygar JD, Hearst M (2006) Why phishing works. In:
Proceedings of the SIGCHI conference on human factors in
also fails to detect phishing sites containing embedded computing systems, ACM, pp 581–590. https://doi.org/10.1145/
objects such as javascript, flash and HTML files. Attackers 1124772.1124861
use embedded objects to replace the textual content of a 4. APWG (2016) Phishing attack trends reports, first quarter 2016.
website to bypass the anti-phishing techniques. http://docs.apwg.org/reports/apwg_trends_report_q1_2016.pdf.
Accessed 01 June 2016
5. (2014) Kaspersky lab:spam and phishing trends and statistics
report q1 2014. https://usa.kaspersky.com/internet-security-cen
8 Conclusion and future work ter/threats/spam-statistics-report-q1-2014. Accessed 15 July 2015
6. Hong J (2012) The state of phishing attacks. Commun ACM
55(1):74–81
In this paper, we have proposed a novel model for detec- 7. Cao Y, Han W, Le Y (2008) Anti-phishing based on automated
tion of phishing sites using heuristic features of websites. individual white-list. In: Proceedings of the 4th ACM workshop
We compared our work with CANTINA and CANTINA? on digital identity management, ACM, pp 51–60
to demonstrate the advantage of our model over these 8. Zhang J, Porras PA, Ullrich J (2008) Highly predictive black-
listing. In: USENIX security symposium, pp 107–122
approaches in detecting phishing sites. The features used in
our model to detect phishing does not depend on initial
123
9. Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: 25. APWG (2014) Global phishing reports 1st half 2014. http://docs.
predictive blacklisting to detect phishing attacks. In: INFOCOM, apwg.org/reports/APWG_Global_Phishing_Report_1H_2014.
2010 Proceedings IEEE, IEEE, pp 1–5. https://doi.org/10.1109/ pdf. Accessed 01 June 2016
INFCOM.2010.5462216 26. He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL, Chen RJ,
10. Almomani A, Wan TC, Altaher A, Manasrah A (2012) Evolving Sutanto A (2011) An efficient phishing webpage detector. Expert
fuzzy neural network for phishing emails detection. J Comput Sci systems with applications 38(10):12,018–12,027. https://doi.org/
8(7):1099 10.1016/j.eswa.2011.01.046, http://www.sciencedirect.com/sci
11. Joshi Y, Saklikar S, Das D, Saha S (2008) Phishguard: a browser ence/article/pii/S0957417411000662
plug-in for protection from phishing. In: Internet multimedia 27. Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina?: a feature-
services architecture and applications, 2008. IMSAA 2008. 2nd rich machine learning framework for detecting phishing web
International Conference on IEEE, pp 1–6. https://doi.org/10. sites. ACM Trans Inf Syst Secur (TISSEC) 14(2):21. https://doi.
1109/IMSAA.2008.4753929 org/10.1145/2019599.2019606, http://dl.acm.org/citation.cfm?doid
12. Chou N, Ledesma R, Teraguchi Y, Mitchell JC, et al (2004) =2019599.2019606
Client-side defense against web-based identity theft. In: NDSS. 28. Miyamoto D, Hazeyama H, Kadobayashi Y (2008) An evaluation
doi: 10.1.1.65.679, http://www.isoc.org/isoc/conferences/ndss/04/ of machine learning-based methods for detection of phishing
proceedings/Papers/Chou.pdf sites. In: International conference on neural information pro-
13. Shahriar H, Zulkernine M (2012) Trustworthiness testing of cessing. Springer, pp 539–546. https://doi.org/10.1007/978-3-
phishing websites: a behavior model-based approach. Future 642-02490-0_66
Generation Computer Systems 28(8):1258–1271. https://doi.org/ 29. Zhang D, Yan Z, Jiang H, Kim T (2014) A domain-feature
10.1016/j.future.2011.02.001, http://www.sciencedirect.com/sci enhanced classification model for the detection of Chinese
ence/article/pii/S0167739X11000045 phishing e-business websites. Inf Manag 51(7):845–853. https://
14. Rao RS, Ali ST (2015) Phishshield: a desktop application to doi.org/10.1016/j.im.2014.08.003, http://www.sciencedirect.com/
detect phishing webpages through heuristic approach. Proc science/article/pii/S0378720614001001
Comput Sci 54:147–156. https://doi.org/10.1016/j.procs.2015.06. 30. Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA
017 (2011) On oblique random forests. In: Joint European conference
15. Srinivasa Rao R, Pais AR (2017) Detecting phishing websites on machine learning and knowledge discovery in databases.
using automation of human behavior. In: Proceedings of the 3rd Springer, pp 453–469
ACM workshop on cyber-physical system security, ACM, New 31. Zhang L, Suganthan PN (2014) Random forests with ensemble of
York, NY, USA, CPSS ’17, pp 33–42. https://doi.org/10.1145/ feature spaces. Pattern Recogn 47(10):3429–3437
3055186.3055188 32. Zhang L, Suganthan PN (2015) Oblique decision tree ensemble
16. Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages via multisurface proximal support vector machine. IEEE Trans
with visual similarity assessment based on earth mover’s distance Cybern 45(10):2165–2176
(EMD). IEEE Trans Dependable Secur Comput 3(4):301–311 33. Gowtham R, Krishnamurthi I (2014) A comprehensive and effi-
17. Wenyin L, Huang G, Xiaoyue L, Min Z, Deng X (2005) Detec- cacious architecture for detecting phishing webpages. Comput
tion of phishing webpages based on visual similarity. In: Special Secur 40:23–37. https://doi.org/10.1016/j.cose.2013.10.004,
interest tracks and posters of the 14th international conference on http://www.sciencedirect.com/science/article/pii/S016740481300
World Wide Web, ACM, pp 1060–1061 1442
18. Hara M, Yamada A, Miyake Y (2009) Visual similarity-based 34. Mohammad RM, Thabtah F, McCluskey L (2014) Predicting
phishing detection without victim site information. In: Compu- phishing websites based on self-structuring neural network.
tational intelligence in cyber security, 2009. CICS’09. IEEE Neural Comput Appl 25(2):443–458
symposium on, IEEE, pp 30–36. https://doi.org/10.1109/ 35. Chiew KL, Chang EH, Tiong WK et al (2015) Utilisation of
CICYBS.2009.4925087 website logo for phishing detection. Comput Secur 54:16–26.
19. Rao RS, Ali ST (2015) A computer vision technique to detect https://doi.org/10.1016/j.cose.2015.07.006
phishing attacks. In: Communication systems and network tech- 36. Moghimi M, Varjani AY (2016) New rule-based phishing
nologies (CSNT), 2015 Fifth international conference on IEEE, detection method. Expert Syst Appl 53:231–242. https://doi.org/
pp 596–601. https://doi.org/10.1109/CSNT.2015.68 10.1016/j.eswa.2016.01.028
20. Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic 37. Aggarwal A, Rajadesingan A, Kumaraguru P (2012) Phishari:
classification of phishing pages. In: NDSS ’10. http://www.isoc. automatic realtime phishing detection on twitter. In: eCrime
org/isoc/conferences/ndss/10/pdf/08.pdf Researchers Summit (eCrime), 2012, IEEE, pp 1–12
21. Khonji M, Iraqi Y, Jones A (2013) Phishing detection: a literature 38. Abdelhamid N, Ayesh A, Thabtah F (2014) Phishing detection
survey. IEEE Commun Surv Tutor 15(4):2091–2121. https://doi. based associative classification data mining. Expert Syst Appl
org/10.1109/SURV.2013.032213.00009 41(13):5948–5959
22. Huh JH, Kim H (2011) Phishing detection with popular search 39. Dietterich TG (2000) Ensemble methods in machine learning. In:
engines: simple and effective. In: International symposium on International workshop on multiple classifier systems. Springer,
foundations and practice of security. Springer, pp 194–207. pp 1–15
https://doi.org/10.1007/978-3-642-27901-0_15 40. Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification
23. Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based and regression-recent developments, applications and future
approach to detecting phishing web sites. In: Proceedings of the directions [review article]. IEEE Comput Intell Mag 11(1):41–53
16th international conference on World Wide Web, ACM, 41. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014)
pp 639–648. https://doi.org/10.1145/1242572.1242659, http://dl. Do we need hundreds of classifiers to solve real world classifi-
acm.org/citation.cfm?id=1242659 cation problems. J Mach Learn Res 15(1):3133–3181
24. Pan Y, Ding X (2006) Anomaly based web phishing page 42. Akinyelu AA, Adewumi AO (2014) Classification of phishing
detection. In: Proceedings-annual computer security applications email using random forest machine learning technique. J Appl
conference, ACSAC, vol 6, pp 381–392. https://doi.org/10.1109/ Math 2014:6. https://doi.org/10.1155/2014/425731
ACSAC.2006.13
123
43. Fette I, Sadeh N, Tomasic A (2007) Learning to detect phishing technique. In: Internet technology and secured transactions, 2012
emails. In: Proceedings of the 16th international conference on international conference for IEEE, pp 492–497
World Wide Web, ACM, pp 649–656 49. Mohammad RM, Thabtah F, McCluskey L (2014) Intelligent
44. Dewan P, Kumaraguru P (2015) Towards automatic real time rule-based phishing websites classification. IET Inf Secur
identification of malicious posts on facebook. In: Privacy, secu- 8(3):153–160
rity and trust (PST), 2015 13th Annual Conference on IEEE, 50. Basnet RB, Sung AH, Liu Q (2011) Rule-based phishing attack
pp 85–92 detection. In: International conference on security and manage-
45. Breiman L (2001) Random forests. Mach Learn 45(1):5–32 ment (SAM 2011), Las Vegas, NV
46. Mangasarian OL, Wild EW (2006) Multisurface proximal support 51. Garera S, Provos N, Chew M, Rubin AD (2007) A framework for
vector machine classification via generalized eigenvalues. IEEE detection and measurement of phishing attacks. In: Proceedings
Trans Pattern Anal Mach Intell 28(1):69–74 of the 2007 ACM workshop on recurring malcode, ACM, pp 1–8
47. Manwani N, Sastry P (2012) Geometric decision tree. IEEE Trans 52. Ho TK (1998) The random subspace method for constructing
Syst Man Cybern Part B (Cybernetics) 42(1):181–192 decision forests. IEEE Trans Pattern Anal Mach Intell
48. Mohammad RM, Thabtah F, McCluskey L (2012) An assessment 20(8):832–844
of features related to phishing websites using an automated
123

Detection of Phishing Websites Using An Efficient Feature-Based Machine Learning Framework

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection of Phishing Websites Using An Efficient Feature-Based Machine Learning Framework

Uploaded by

Copyright:

Available Formats

Neural Computing and Applications (2019) 31:3851–3873

Detection of phishing websites using an efficient feature-based

1 Introduction from the users. It is carried out with a mimicked page of a

Fig. 3 Screenshot of compromised domain and a Phishing page

Fig. 4 Lifespan of phishing site in PhishTank

Table 1 Summary of related work in comparison with our work

Pan and Ding [24] 2006 No No Yes No SVM 90

2.1 Random Forests significant performance in classification of data. This

Fig. 5 Architecture of the proposed system

are similar then the feature is set to 0, i.e., website is

Hence, HF7 can be defined as given below.

4.2 Dataset used FP

Table 2 Dataset used in our

Table 3 Tuning parameters for the experiments

Table 4 Evaluation of heuristics

5.2 Experiment 2: Evaluation of heuristics

4.06 We tested the scheme with the same dataset used in

5.3 Experiment 3: Evaluation of third-party

99.31 performed the best with an average prediction accuracy of

Table 5 Evaluation of heuristics

Table 6 Evaluation of our

Table 7 Comparison of our

Table 8 Comparison of RF with oRF

Fig. 10 Comparison of error rates of RF with different types of RF

6 Discussion same links pointing to images of legitimate site instead of

Fig. 11 Performance of individual features

You might also like