Professional Documents
Culture Documents
https://doi.org/10.1007/s00521-017-3305-0(0123456789().,-volV)(0123456789().,-volV)
ORIGINAL ARTICLE
Received: 19 July 2016 / Accepted: 27 December 2017 / Published online: 6 January 2018
The Natural Computing Applications Forum 2018
Abstract
Phishing is a cyber-attack which targets naive online users tricking into revealing sensitive information such as username,
password, social security number or credit card number etc. Attackers fool the Internet users by masking webpage as a
trustworthy or legitimate page to retrieve personal information. There are many anti-phishing solutions such as blacklist or
whitelist, heuristic and visual similarity-based methods proposed to date, but online users are still getting trapped into
revealing sensitive information in phishing websites. In this paper, we propose a novel classification model, based on
heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of
existing anti-phishing techniques. Our model has been evaluated using eight different machine learning algorithms and out
of which, the Random Forest (RF) algorithm performed the best with an accuracy of 99.31%. The experiments were
repeated with different (orthogonal and oblique) random forest classifiers to find the best classifier for the phishing website
detection. Principal component analysis Random Forest (PCA-RF) performed the best out of all oblique Random Forests
(oRFs) with an accuracy of 99.55%. We have also tested our model with the third-party-based features and without third-
party-based features to determine the effectiveness of third-party services in the classification of suspicious websites. We
also compared our results with the baseline models (CANTINA and CANTINA?). Our proposed technique outperformed
these methods and also detected zero-day phishing attacks.
Keywords Cyber-attack Phishing Anti-phishing Heuristic technique Machine learning algorithms Random Forest
Oblique Random Forest
123
3852 Neural Computing and Applications (2019) 31:3851–3873
attacks taking place in the world. According to the latest websites along with security indicators such as green
APWG (2016) Q4 report [2], 1,220,523 number of phishing padlock, HTTPS connection etc. In Fig. 2, we can see that
attacks were recorded in 2016 and has been confirmed to be an attacker is able to get HTTPS and green padlock for his
the highest than in any year since it began monitoring in phishing website. Hence, HTTPS connection is no longer
2004. The statistics from 2010 to 2016 can be seen in guaranteed to decide legitimacy of a website.
Fig. 1. In Malware phishing, attacker inserts a malicious soft-
According to [3], online users fall for phishing due to ware such as Trojan horse into a compromised legitimate
various factors such as: site without the knowledge of a victim. According to
APWG(2016) report [4], 20 million new malware samples
1. Inadequate knowledge of computer systems.
were captured in the first quarter of 2016. The vast majority
2. Inadequate knowledge on security and security indi-
of the late malwares are multifunctional, i.e., they steal the
cators. (In the current scenario, even the indicators are
information, make the victims system as a part of botnet or
being spoofed by the phishers.)
download and install different malicious software without
3. Inadequate attention to warnings and proceeding
client’s notification [5].
further by undermining the strength of existing tools.
Spear Phishing [6] target a specific group of people or
(abnormal behavior of toolbars)
community belonging to an organization or a company.
4. Inadequate attention to the visual deceptive text in
They send emails which pretend to be sent by a colleague,
URL and Website content. (For example, paypal text is
manager or a higher official of the company requesting
replaced as paypa1 in URL of address bar, copyright or
sensitive data related to the company. The main intention
title of the phishing website.)
of general phishing is financial fraud, whereas spear
Phishing attacks take place through various forms such phishing is a collection of sensitive information. Whaling
as email, websites and malware. To perform email phish- [6] is a type of spear phishing where attackers target bigger
ing, attackers design fake emails which claim to be arriving fish like executive officers or high profile targets of private
from a trusted company. They send fake emails to millions business, government agencies or other organizations.
of online users assuming that at least thousands of legiti- There are many anti-phishing techniques proposed in the
mate users would fall for it. literature to detect and prevent phishing attacks. We have
In website phishing, attacker builds a website which categorized these anti-phishing techniques into 4
looks like a replica of legitimate site and draws the online categories.
user to the website either through advertisements in other
websites or social networks such as Facebook and Twitter
etc. Some of the attackers are able to manage phishing
·105
12
No of attacks * 105
10
0
2010 2011 2012 2013 2014 2015 2016
Year
Fig. 1 Phishing attacks from 2010 to 2016 Fig. 2 Phishing website of PayPal with HTTPS protocol
123
Neural Computing and Applications (2019) 31:3851–3873 3853
• List-based techniques: Most of the modern browsers background of web page is slightly changed without
such as Chrome, Firefox and Explorer etc. follow list- deviating from visual appearance of legitimate site.
based techniques to block phishing sites. There are two • Machine learning-based techniques: Nowadays, most
types of list-based techniques such as whitelist and of the researchers are concentrating on the use of
blacklist. The whitelist [7] contains a list of legitimate machine learning algorithms (ML) applied on the
URLs which can be accessed by the browsers. The features extracted from the websites to detect phishing
browser downloads the website, only if the URL is attacks. These techniques are a combination of heuristic
present in the whitelist. Due to this behavior, even the methods and machine learning algorithms, i.e., dataset
legitimate websites which are not whitelisted are also used by the machine learning algorithms is extracted
blocked resulting in high false positives. The blacklist through heuristic methods. Some of the machine
[8, 9] contains phishing or malicious URLs which are learning algorithms are sequential minimum optimiza-
blocked by the browsers in downloading the webpages. tion (SMO), J48 tree, Random Forest (RF), logistic
Due to this behavior, the phishing sites which are not regression (LR), multilayer perceptron (MLP), Baye-
blacklisted are also downloaded by the browser result- sian network (BN), support vector machine (SVM) and
ing in high false negatives. These non-blacklisted AdaBoostM1 etc. As ML-based techniques are based
phishing sites are also called as Zero-day phishing on heuristic features, they are able to identify the zero-
sites [10]. A small change in the URL is sufficient to day phishing attacks which make them advantageous
bypass the list-based techniques. Frequent update of than list-based techniques. These techniques work
these lists is mandatory to counter the new phishing efficiently on the large sets of data. According to
sites. [20], it is possible to achieve more than 99% true
• Heuristic-based techniques [11–15]: These techniques positive rate and less than 1% false positive rate.
use features extracted from the phishing website to According to [21], ML-based classifiers are the classi-
detect phishing attack. Some of the phishing sites do fiers which achieved more than 99% accuracy. This
not have common features resulting in poor detection made us concentrate on using machine learning algo-
rate using this mechanism. As this approach does not rithms as classifiers. The other side of ML-based
use list-based comparison, it results in less false techniques is that their performance depends on the size
positives and less false negatives. But it has less of training data, feature set and type of the classifiers.
accuracy compared to list-based techniques as there is
Some of the existing techniques [22–24] use third-party-
no guarantee of existence of these features in all
based features such as search engine, WHOIS or page
phishing websites. An attacker can bypass the heuristic
ranking features to detect phishing attacks. These tech-
features once he knows the algorithm or features used
niques fail when attackers use compromised domains for
in detecting phishing sites thereby reaches his goal of
hosting their phishing pages. Figure 3 shows original
stealing sensitive information. This technique detects
website of compromised domain and a phishing page
zero-day phishing attacks which the list-based tech-
which is hosted on the compromised domain targeting HM
niques fail to detect.
Revenue and Customs website. As per the statistics of
• Visual Similarity-based Approach [16–19]: The main
APWG report 1H2014 [25], the life spans of phishing sites
objective of the phisher is to deceive the user by
have a median of less than 10 hours. This report also
designing an exact image of legitimate site such that the
specifies that half of the phishing sites went offline in less
user does not get any suspicion on the phishing site.
than a day, but most of the phishing pages using compro-
Hence, the anti-phishing techniques compare suspi-
mised domains will be alive for more than a day in the
cious website image with legitimate image database to
Internet which can be seen in Fig. 4. From the figure, we
get the similarity ratio, used for the classification of
can also note that phishing site is still alive after 9 days.
suspicious websites. The website is classified as
Another APWG report [2] recorded that there is an increase
phishing when the similarity score is greater than a
in 65% of total phishing attacks compared to 2015. The
certain threshold else it is treated as legitimate. The
statistics from Fig. 1 and the above APWG reports reveal
limitations of this technique are as follows: (1) Image
that the existing techniques are not able to prevent the
comparison of suspicious website with entire legitimate
phishing attacks. Hence, there is a need for designing better
database store takes more time complexity. (2) More
mechanisms for detecting phishing websites.
space to store legitimate image database. (3) Web page
Heuristic techniques exploit the unique characteristics of
with animated website compared with phishing website
phishing websites and consider them as a common
leads to the low percentage of similarity that leads to
parameter for detecting phishing websites. As heuristic
high false negative rate. This technique fails, when the
techniques have the advantage of detecting zero-day
123
3854 Neural Computing and Applications (2019) 31:3851–3873
123
Neural Computing and Applications (2019) 31:3851–3873 3855
attacks and have been used by major browsers and anti- of various oblique Random Forest algorithms (oRF). Sec-
virus softwares in real time [21], we have driven our tion 3 explains the proposed work of our model. Imple-
approach toward the heuristic mechanism. This technique mentation details are discussed in Sect. 4 and evaluation of
has a risk of misclassification of legitimate sites into various experiments is given in Sect. 5. Discussion is
phishing sites (false positives) since characteristics are not presented in Sect. 6, and limitations of the proposed
guaranteed to always exist in phishing sites. In this paper, approach are given in Sect. 7. Finally, the conclusion and
we attempt to add new heuristic features (in addition to the future work are given in Sect. 8.
existing features) with machine learning algorithms to
reduce the false positives in detecting new phishing sites.
We have also made an attempt to identify the best machine 2 Related work
learning algorithm to detect phishing sites with high
accuracy than the existing techniques. We have used eight As our proposed method fits into machine learning-based
machine learning algorithms (J48 tree decision, sequential technique, we describe some of the existing machine
minimal optimization (SMO), logistic regression (LR), learning-based techniques that are used in the detection of
multilayer perceptron (MLP), Bayesian network (BN), phishing websites.
Random Forest (RF), support vector machine (SVM) and Pan and Ding [24] proposed an anti-phishing approach
AdaBoostM1) to classify the websites as legitimate and based on the extraction of website identity from DOM
phishing. Based on the experimental observations, Random objects such as title, keyword, request URL, anchor URL
Forest outperformed the others. The choice of considering and main body etc. When a phishing site fakes a legitimate
these machine learning algorithms is based on the classi- site, it claims false identity demonstrating abnormal
fiers used in the recent literature [24, 26–29]. behavior compared to the legitimate site. They experi-
Oblique Random Forests (oRF) performs better than mented with ten features using supporting vector machine
orthogonal Random Forests [30]. Hence, we compared RF classifier for the classification of data. This approach fails
with novel oRF models proposed in [30–32] for identifying when the attacker hosts a phishing page on the compro-
the best Random Forest classifier. From the experimental mised domain.
comparison, we observed that principal component analy- CANTINA [23] technique operates on the textual con-
sis Random Forest (PCA-RF) performed the best out of all tent present in the website. The term frequency and inverse
oRFs. It should be noted that orthogonal Random Forest is document frequency (TF-IDF) algorithm is applied on the
simply referred as Random Forest throughout this paper. textual website content to extract high TF-IDF score words
To the best of our knowledge, our work is the first to use and fed to the search engine to identify a phishing site. It
oblique Random Forests (oRF) algorithm for classification also includes additional heuristic features and simple for-
of phishing websites. ward linear model for classification of websites. As the
Our paper makes the following three research contri- performance of this approach depends on the textual con-
butions. First, we propose five novel heuristic features tent of the suspicious site, it fails when the textual content
(HF3, HF4, HF6, HF7, HF8) capturing key characteristics is replaced with images. It also has a performance problem
of phishing sites using the source code of a website. Sec- with respect to time since it depends on the third-party
ond, we have experimentally shown that third-party ser- service search engine.
vices like the search engine, WHOIS database and Alexa Miyamoto et al. [28] evaluated the performance of
page ranking complements the heuristic techniques to machine learning algorithms in classifying the websites as
detect phishing. The results of our proposed technique legitimate or phishing website. Authors incorporated all the
illustrate that combining third-party-based features with features of CANTINA model to detect the phishing web-
proposed features gives the classification accuracy of sites. They employed nine machine learning algorithms for
99.3% with RF and 99.55% with PCA-RF as classifiers. the classification of websites. Among all of them, the
Third, we propose a feature (HF7) which detects phishing AdaBoostM1 algorithm performed the best with a lowest
pages when the entire website is replaced with a single error rate of 14.15%. This technique has the same limita-
image. One major advantage of this method is that it does tion as of CANTINA approach, i.e., it fails when the tex-
not require any legitimate image store for identifying single tual content is replaced with an image.
image-based phishing whereas traditional visual similarity Xiang et al. [27] proposed a comprehensive feature-
anti-phishing techniques need image data store to compare based approach which uses machine learning algorithms
the logos or entire webpage screenshot. such as Bayesian networks for classification of phishing
The rest of the paper is organized as follows: Section 2 websites. This technique is called CANTINA? which is an
reviews a set of anti-phishing techniques based on machine extension to CANTINA including eight novel features such
learning-based approach and also gives the brief overview as HTML DOM features, third-party services and search
123
3856 Neural Computing and Applications (2019) 31:3851–3873
engine-based features. It has divided entire datasets into This technique has an advantage of adapting its neural
two parts: First, Unique testing phish dataset resulted in network according to the change of phishing characteris-
92% detection accuracy on feeding to machine learning tics. For better performance, this model has to be retrained
algorithm. Second, near duplicate testing phish dataset frequently with a fresh and up to date training dataset.
resulted in 99% detection accuracy on feeding to machine Chiew et al. [35] presented a visual similarity-based
learning algorithm. The authors have used six learning technique where the logo of a suspected website is
algorithms to train the dataset and calculate the perfor- extracted using machine learning technique and fed to
mance of the model. The Bayesian network algorithm Google image search to retrieve target identity. The sus-
performed the best in comparison with other algorithms. pected site is termed as phishing when actual domain is not
This technique fails to detect the phishing sites when the equal to domain resulted from Google image search result.
website textual content is replaced by an image. This technique depends on success ratio of logo extraction
He et al. [26] proposed a heuristic-based technique through machine learning. This technique fails in detecting
which uses the combination of CANTINA, Anomaly, and phishing attacks on logo absence websites.
PILFER methods to detect phishing sites. The authors Moghimi and Varjani [36] proposed a rule based method
extracted twelve features from the suspicious website and and developed it as a browser extension to detect phishing
fed them to the support vector machine for classification of sites. This method is based on the features extracted from
phishing and legitimate websites. As this technique adop- website content, used support vector machine algorithm
ted the features either from CANTINA or [24], it faces the (SVM) to classify the phishing websites. As this approach
same limitation of CANTINA technique. It also suffers is solely dependent on website content, it fails when
from high time complexity in loading webpage due to the attacker redesigns the website with modified visual content
use of third-party services like search engine, page ranking such that brand name is misspelled or trimmed, changing
etc. all links to its local domain. This technique only targets
Zhang et al. [29] proposed a model consisting of URL phishing sites imitating legitimate bank websites.
and website content to detect Chinese phishing e-business Whittaker et al. [20] proposed a blacklist technique
websites. Authors incorporated fifteen domain specific which extracts common features of known phishing web-
features for detecting phishing attacks in Chinese e-busi- sites, and uses these features for the detection of phishing
ness websites. They evaluated the model with four different sites. They used Random Forest classifier for the detection
machine learning algorithms for classification of phishing of phishing sites. The authors claim that RF performs the
sites. Sequential minimal optimization (SMO) algorithm best with noisy live phishing data and they are able to
performed best in detecting phishing site with an accuracy achieve a false positive rate lower than 0.1%. As this
of 95.83%. The limitation of this approach is that it targets technique depends on the blacklist of phishing data, it fails
only one domain of phishing sites, i.e., it may not work to detect phishing sites which are not blacklisted.
efficiently with non-Chinese websites. Aggarwal et al. [37] proposed a technique called
Gowtham and Krishnamurthi [33] proposed a heuristic- PhishAri which detects phishing URLs in the tweets. They
based technique combined with machine learning classifier used URL, WHOIS, tweet and network-based features with
to detect phishing site. They have used private whitelist Random Forest classifier for the detection of malicious
and login form identifier filters to reduce the false positive URLs. As this technique solely depends on the URL text
rate and computation cost in the system. Support vector and not on the content of a website, it might fail to give
machine algorithm is used as a classifier for classification effective results when the phishing URL is hosted on the
of phishing websites. The authors have extracted 15 fea- compromised domain.
tures from URLs characteristics, URL identity, keyword Table 1 provides a summary of these works in com-
identity set and domain name analysis to form a feature parison to our work with respect to five features. These
vector. This feature vector is fed to SVM to calculate the include detection of phishing sites that replaces textual
performance of a model. The proposed model has achieved content with an image (Image-based phishing), detection of
true positive rate of 99.65% and false positive rate of phishing sites that contains most of the hyperlinks directed
0.42%. As this technique depends on the textual content of toward a common page (Common Page Detection),
a website, it might not work effectively when the entire detection of phishing sites that are hosted in any language
website content is replaced with a single image. (language independence), detection of phishing sites that
Mohammad et al. [34] presented an intelligent model for consists of maximum number of broken links (Broken
detecting phishing websites based on self-structuring neu- links) and classifiers used for classification of phishing
ral networks. The authors extracted seventeen features sites.
from URL and website content which are used as an input Attackers are using common anchor links in the entire
to artificial neural networks for classification of websites. website such that anti-phishing techniques [14, 24, 26, 29]
123
Neural Computing and Applications (2019) 31:3851–3873 3857
depending on NULL anchor (#) links are bypassed. Simi- best when encountered with noisy data due to nature of
larly, they also insert broken anchor links of local domain diversity of training data.
in the entire website to bypass the anti-phishing techniques A recent survey [40] explores recent developments in
[24, 26, 29, 34, 38] that depend on foreign anchors feature. ensemble classification and regression methods and their
Using our proposed features, we are able to detect phishing applications. The survey includes traditional, state-of-the-
sites which are not being detected by existing techniques art ensemble methods and categorized them into (a) con-
with a very high accuracy and hence can be used for better ventional such as bagging, boosting, and Random Forest
classification of phishing websites. etc. (b) deep learning-based ensemble methods. They
The major classifiers of phishing detection models used provided different variations of ensemble methods and also
in the literature are support vector machine, neural net- explored extended versions of Random Forests such as
works, sequential minimum optimization and Adaboost oblique Random Forest with ridge regression, support
etc. From Fig. 1, it is evident that the phishing attacks are vector machine, principal component analysis (PCA), and
still on the rise despite the best efforts to detect them. linear discriminant analysis (LDA) etc.
Hence, the goal of our work is to find the best machine From [41], it is observed that Random Forest performs
learning classifier that can improve the accuracy of the best among all 179 classifiers on the real world clas-
phishing detection system. sification problem. RF is also being used for detecting
From [39], it is noted that ensembles perform better than phishing emails [42, 43], phishing tweets [37] and mali-
any single classifier. An ensemble of classifiers is a set of cious posts on Facebook [44]. Whittaker et al. [20] used
classifiers whose individual decisions are combined either Random Forest classifier for maintaining Google’s phish-
by weighted or unweighted voting to classify the testing ing blacklist pages automatically. Based on
data. The accuracy of the ensemble method depends on the [20, 37, 42, 43], we chose the Random Forest classifier for
diversity of set of classifiers (i.e., they make different errors classification of data using our novel proposed features. In
at new data points). The authors also reviewed various our proposed technique, we have also used different obli-
methods to generate ensembles that can be applied to dif- que Random Forests (oRF) for phishing detection. In the
ferent machine learning algorithms. The methods for con- next subsection, we provide a brief overview of different
structing ensembles include manipulating the training Random Forest algorithms.
dataset, input features, output targets, injecting randomness
etc. The Authors concluded that Random trees performs
123
3858 Neural Computing and Applications (2019) 31:3851–3873
123
Neural Computing and Applications (2019) 31:3851–3873 3859
<Protocol>://<Subdomain>.<Primary domain>.<gTLD>.<
impression as legitimate URL to the user. For example,
ccTLD>/<Path domain>
consider a phishing URL: http://www.signin.ebay.
where gTLD is generic top level domain and ccTLD is com/@www.phishing.com/abs/def/index.html., user
country code Top Level Domain. For example, in URL gets fooled by clicking on this link and is directed to
http://www.reg.signin.nitk.com.pk/secure/login/web/index. phishing.com website. When the browser encounters
php., we have, protocol as http, Subdomain as reg.signin, @ symbol in URL, it ignores everything prior to @
primary domain as nitk, gTLD as com, ccTLD as pk, symbol and downloads the real address which follows
Domain as nitk.com.pk, Hostname as reg.signin.nitk.- @ symbol.
com.pk and pathname as secure/login/web/index.php. 1 if @ present in URL
The existing features which identify visual deceptive UF2 ¼ ð2Þ
0 otherwise
text techniques in the URL of a website are described
below. In particular, feature UF2, UF4 and UF5 are 3. UF3-Lengthy URL to hide suspicious domain: This
adopted from [23, 48, 49]; feature UF3 adopted from feature counts the number of characters in a hostname
[48–50] and feature UF1 is a variant of one feature in to detect the status of a website. This feature is similar
CANTINA [23]. to feature UF1 except that phisher uses long URL by
adding more number of characters to the URL to hide
1. UF1-Dots in Hostname: This feature counts the the actual phishing domain of the website. Larger the
number of dots in the hostname of a URL to detect number of characters, higher the possibility of website
the status of a page as phishing or legitimate site. to be a phishing site.
Usually, URL contains a primary domain, subdomain,
gTLD or ccTLD or both. The attackers add the domain UF3 ¼ Total length of hostname in URL ð3Þ
of a legitimate website as the subdomain of a phishing
4. UF4-Using IP Address: This feature is a binary feature
website. Thus, users fall for phishing by seeing the
which checks the presence of IP address in a given
legitimate domain in the phishing URL. For example,
URL. If present, the feature is set to 1 else set to 0.
let us consider a phishing URL targeting ebay website:
Most of the legitimate sites do not use IP address as an
http://www.reg.signin.ebay.secure.web.login.user.phish.
URL to download a webpage. Hence when the user
com/index.html.. Here, ebay is added in subdomain of
encounters IP address, he can almost be sure that
phishing website (phish.com). The attacker includes
someone is trying to steal his/her sensitive information.
more number of dots to hide the actual domain of
phishing site. Larger the number of dots in the host- 1 if IP present
UF4 ¼ ð4Þ
name, higher the probability of website to be a phishing 0 otherwise
site.
5. UF5-Presence of HTTPS (HTTP plus SSL): This
UF1 ¼ Number of dots in Hostname ð1Þ feature checks for the presence of HTTPS in a given
URL. If the URL contains HTTPS then the feature is
2. UF2-URL with @ symbol: This feature checks for the
set to 0 else set to 1. Most of the legitimate websites
presence of special character @ in the URL. This
use HTTPS connection when there is a need of
feature is a binary feature which is set to 1 for the
sensitive information to be forwarded. This feature is
presence of @ in URL else set to 0. Phishers add
not enough for detecting phishing sites because the
special symbol @ after the target domain followed by
attackers are also able to get fake HTTPS connection
actual phishing domain to the URL giving an
to their phishing websites as shown in Fig. 2.
123
3860 Neural Computing and Applications (2019) 31:3851–3873
0 if HTTPS present of website to get unique identity terms of a suspicious
UF5 ¼ ð5Þ website. However, CANTINA fails when TF-IDF
1 otherwise
extracts wrong keywords as website identity keywords.
Hence, we skipped TF-IDF technique to calculate
3.1.2 Third-party service-based features website identity terms and considered Title, Descrip-
tion and Copyright as website identifiers in our
In this section, we explain the existing third-party service- proposed model. This feature queries the search engine
based features used in our detection model. These features with Title or Copyright or Description of a suspicious
have been significant in detecting phishing websites. In website and classifies the status of website based on the
particular, feature TF1 is adopted from [23, 48, 49]; feature presence of its domain in top N search results. We
TF2 is adopted from [48–51]; and feature TF3 is adopted considered N as 10 based on the experimental results
from CANTINA? [27]. of CANTINA?. Due to the deprecation of Google
search API,4 we considered Bing search engine for the
1. TF1-Age of Domain: This feature calculates the age of evaluation of this feature.
a website’s domain which is extracted from WHOIS2 8
database. The rationale behind this feature is that most < 0 if local domain matched with top N
TF3i ¼ search results
of the phishing websites are hosted on newly registered :
1 otherwise
domains resulting absence of entries in WHOIS
database or age of domain as less than a year. But ð8Þ
this is not always guaranteed in phishing sites because where i = 1, 2, 3 for title, copyright and description,
phishers might also host their phishing page on a respectively
compromised domain as shown in Fig. 3. The impor-
tance of this feature depends on the number of
compromised domains used by the phishers. 3.1.3 Hyperlink-based features
0 if DNS Record absent
TF1 ¼ ð6Þ In this section, we explain the features which are extracted
Age otherwise
from hyperlinks located in the source code of a website. A
2. TF2-Page Rank: As phishing sites have a short link or hyperlink is an element in an electronic document
lifespan and less traffic compared to legitimate sites; that connects from one web source to another. The web
these sites are either not recognized or least popular source can be an image, program, HTML document and an
rated by page ranking services like Alexa.3 This element within an HTML document.
feature calculates the page rank of a suspected website To imitate the behavior of trusted website, phishing
in Alexa Database. To obtain the PageRank score from websites would impersonate the links or hyperlinks, which
the Alexa PageRank system, we just send a normal point to the trusted website domain. This implies that
HTTP request to the (http://data.alexa.com/data?cli= phishing pages regularly contain hyperlinks that point to a
10&url=’’?domain) and use XML parser to get the foreign domain.
Alexa ranking. In this paper, we consider links pointing to the local
domain as local links and links pointing to the foreign
0 if Page Rank absent
TF2 ¼ ð7Þ domain as foreign links. For any legitimate site, the number
Page Rank otherwise
of local links would be greater than the number of foreign
3. TF3-Website in Search Engine Results: Keywords links. Hence, we consider this as one of the feature to
which uniquely defines a website are fed to a search classify a suspected site as a legitimate or phishing site.
engine to display the related URLs as results. The Similarly, we extract various URL features such as script
webpage is classified as legitimate if its base domain is resource URLs, CSS URLs, Images source URLs to clas-
matched with any of the domains of top N search sify the phishing site from the legitimate site. In particular,
results otherwise classified as a suspicious site. HF1, HF2 and HF5 are adopted from [24, 48]; and features
CANTINA used this feature for detection of a phishing HF3, HF4, HF6, HF7, HF8 are the novel ones we proposed.
attack, by feeding high ranked term frequency inverse 1. HF1-Frequency of domain in anchor links: This
document frequency (TF-IDF) terms to Google Search feature examines all the anchor links in source code
engine. TF-IDF algorithm is applied on the entire text of a website and compares the most frequent domain
with the local domain of a website. If both the domains
2
https://www.internic.net/whois.html.
3 4
http://www.alexa.com/siteinfo/page-rank-calculator.com. https://developers.google.com/web-search/.
123
Neural Computing and Applications (2019) 31:3851–3873 3861
123
3862 Neural Computing and Applications (2019) 31:3851–3873
8
<a href="#"> or <a href="#content"> < alnullf if alnf [ 0
HF6 ¼ alnf ð14Þ
:
8 0 otherwise
< alnullw
if alnw [ 0 where alnullf is number of null links in footer and alnf is
HF5 ¼ al ð13Þ
: nw total number of anchor links in footer section of a
0 otherwise
website.
where alnullw is number of null links and alnw is total 7. HF7-Presence of anchor links in HTML body: This
number of anchor links in a website. feature checks for the presence of anchor links in the
6. HF6-NULL Links ratio in Footer: This feature extracts body section of source code of a website. If absent,
ratio of Null links in footer section of website to the then the feature is set to 1 else it is set to 0. The
total number of footer links in a website. In this rationale behind this feature is that there exists at least
feature, we identify links with null value in footer one anchor link such as signin, signup, forgot
section rather than on entire HTML, which reduces password or others in legitimate site’s login page
false positive ratio and time complexity compared to whereas in phishing websites, attackers replace entire
HF5 feature. We might deal with less number of links legitimate site content with a single background image.
rather than all links in the website. Most of the We termed these kinds of phishing sites as Image-
legitimate websites do not often use Null links in the based Phishing. Attacker attains the above case by
footer section of a website, but the phishing sites taking a screenshot of targeted legitimate site and
usually insert null links in the footer section to make embeds it as background page for their phishing site.
the user stay on the same page until the sensitive On top of the screenshot, they insert input fields such
information is submitted by the user. Figure 6 shows as username or email, password and Login Button
the Screenshot of phishing site along with NULL following the same styles of targeted legitimate
footer link. domain.
Figure 7 explains how the phishing is carried out by
replacing textual content with a single background
image. Figure 7a shows the background image used by
Fig. 6 Phishing site with null links, footer and copyright section
123
Neural Computing and Applications (2019) 31:3851–3873 3863
Fig. 7 a Background image used for phishing, b Phishing site without background image, c fully developed phishing site along with its source
code
the attacker. The location of background image can be image from the source code of phishing site. Figure 7c
seen in address bar https://gator3231.hostgator.com/ shows the fully developed phishing site with back-
*cheaks/bost/FTEROO09K/img/bg.PNG. Figure 7b ground image and input fields embedded into the body
shows the form imitating the same style of legitimate of website. The background image location can be seen
site. This figure is taken by removing the background in the inspect code of the website. The input fields are
123
3864 Neural Computing and Applications (2019) 31:3851–3873
properly aligned by adjusting the position of fields with comparison of RF with various oblique Random Forest for
z-index, absolute position, height, width, top and left identifying best RF classifier. The reason behind the high
dimensions. For better understanding, sample source performance of Random Forest is due to the ensemble of
code of phishing site is given below: weak learning models (tree predictors) with diversity of
training data among them. Note that combining multiple
<html lang=‘‘en">
<head>...</head>
weak classifiers into one aggregated classifier leads to
<body style="background-image: url(img/bg.PNG);
background-repeat: no-repeat;"> <!--
Background image inserted here-->
better classification performance than that of any individual
<form method="post" action="LOGIN.php" style="z-
index: 1; width: 395px; height: 232px; classifier [52]. It is because the probability of making
position: absolute; top: 208px; left: 211px"
>
<div style="z-index: 1; width: 340px; height: 40
mistakes at the same place in all of the trees is very low.
px; position: absolute; top: 53px; left: 24
px">
<input style="z-index: 1; position: absolute;
Hence, it achieves high accuracy than individual weak
top: -3px; left: 0px; width: 338px; height:
38px" type="email">
</div>
classifier.
... <!--similarly password and submit button are
also properly aligned to look like
legitimate site-->
</form>
</body>
4 Implementation
</html>
123
Neural Computing and Applications (2019) 31:3851–3873 3865
123
3866 Neural Computing and Applications (2019) 31:3851–3873
Random Forest & Oblique Random Forest (All types) Maximum depth: unlimited
Number of trees to be generated (ntree): 100
Number of attributes (mtry): 4
Number of Iterations: 10
J48 The confidence factor used for pruning: 0.25
The minimum number of instances per leaf: 2
Number of folds: 3
Logistic regression Maximum number of iterations: unlimited
Ridge value in the log-likelihood: 1.0E-8
Bayes network Estimator: Simple estimator (alpha: 0.5 )
Search algorithm: hill climbing algorithm (max no of parents: 2)
Multilayer perceptron Learning rate: 0.3
Momentum: 0.2
Number of Epochs: 500
Hidden layers: (attribs ? classes)/2
Validation set size: 0
Validation threshold: 0
Sequential minimal optimization Complexity: 1
Kernel: PolyKernel (Exponent: 1 , Cache: 250007)
Epsilon: 1.0E-12
Tolerance parameter: 0.001
AdaBoostM1 Classifier: decision stump
Number of Iterations: 10
Weight threshold: 100
Support vector machines Kernel: Radial basis kernel ‘‘Gaussian’’
Sigma: 0.05
SVM type: C-SVC
Epsilon: 0.1
Cost of constraints violation (C): 5
websites. We performed comparison of all the classifiers it is observed that RF classifier achieved least average
with a primary goal of choosing the best classification prediction error rate.
model among them. Random Forest algorithm performed
the best with an average prediction accuracy of 99.31%
followed by J48 tree decision algorithm with an accuracy 5.1.1 Evaluation of heuristics with the inclusion
of 98.98% as shown in Table 4. All the machine learning and exclusion of proposed features
algorithms have classified the websites with an accept-
able true positive rate which indicates the goodness of our Figure 9 shows the performance of the proposed system
proposed features in classifying the websites. Figure 8 with the inclusion and the exclusion of proposed features
shows prediction errors of eight classifiers. From the figure, (HF3, HF4, HF6, HF7, HF8) using Random Forest
123
Neural Computing and Applications (2019) 31:3851–3873 3867
Fig. 9 Performance of the proposed system with the inclusion and the 5.4 Experiment 4: Comparison of our work
exclusion of proposed features with other techniques
algorithm. From Experiment 1 results, it is evident that the In this experiment, we implemented existing works such as
Random Forest algorithm performs better than the other CANTINA and CANTINA? and tested them with the
algorithms. Therefore, we used the same Random Forest same dataset used in Experiment 1. We compared the
algorithm to study the effect of our proposed features in performance of our work with these techniques with an aim
detecting phishing websites. From Fig. 9, it is evident that of evaluating the performance of our features against the
the accuracy of phishing detection system has improved baseline models. This experiment demonstrates that our
with the inclusion of proposed features compared to the model offers more accuracy than these methods in detect-
system with the exclusion of proposed features. Hence, the ing phishing websites. Random Forest algorithm performed
proposed features would complement the detection of the best with an accuracy of 99.31% followed by J48
phishing websites with a significant accuracy. algorithm with an accuracy of 98.98% as shown in Table 7.
123
3868 Neural Computing and Applications (2019) 31:3851–3873
5.5 Experiment 5: Comparison of RF with oRF We employed the methods of various oRF such as
MPRaF-P, MPRaF-T, PCA-RF and LDA-RF proposed by
In this experiment, we evaluated the performance of vari- Zhang and Ponnuthurai [31, 32]. MPRaF-P, MPRaF-T
ous oblique Random Forest classifiers [30–32] and com- means MPSVM-based Oblique Random Forest with axis-
pared with orthogonal RF to identify the best classifier for parallel split, Tikhonov regularization respectively and
detection of phishing sites. PCA-RF, LDA-RF are PCA-based RF, LDA-based RF.
123
Neural Computing and Applications (2019) 31:3851–3873 3869
We also employed the method proposed by Menze et al. observed that PCA-RF outperformed all of the various RF
[30] for simulating various oRFs such as (oRF-svm), with an accuracy of 99.55% followed by MPRaF-P with an
(oRF-ridge), (oRF-ridge_slow), (oRF-pls), accuracy of 99.43%. Figure 10 shows prediction errors of
(oRF-log) and (oRF-rnd). The decision hyperplanes various oRF classifiers. From the Fig. 10, it is evident that
in these oRFs are based on support vector machine, ridge PCA-RF and MPRaF-P performed better than orthogonal
regression (fast and slower versions), partial least squares RF with least average prediction error rate. It may be
regression and randomness respectively. All the variations observed that oRF with trained pls, ridge, ridge_slow, svm
of above mentioned oblique Random Forests and orthog- and rnd achieved less accuracy than RF. It is due to the fact
onal Random Forest were constructed with 100 decision that the oRF with above training models outperforms RF in
trees (ntree = 100) and tuned with mtry = 4, where mtry is case of heavily correlated and high dimensional data, but
subset of attributes needed for split. mtry is calculated as their performance falls below RF in case of factorial or
square root of total number of attributes. discrete data [30]. Note that our extracted phishing website
mtry ¼ maxðsqrtðncolðxÞÞ; 2Þ ð24Þ features are not heavily correlated and had maximum dis-
crete data.
Note that, all the models were trained with 75% of The usage and popularity of Random Forest is increas-
dataset and tested with remaining dataset. The experiments ing due to its flexibility and simplicity on complex data. It
are repeated ten times for each oblique Random Forest and overcomes the problem of overfitting and handles the
then the average value is considered for the comparison. sparse/missing data well.
The results are shown in Table 8. From the results, it is
Sensitivity 0.9945 0.9672 0.9945 0.9945 0.9750 0.9874 0.9804 0.9891 0.9885 0.9925 0.9863
Specificity 0.9923 0.9865 0.9942 0.9961 0.9809 0.9364 0.9622 0.9424 0.9369 0.9407 0.9741
Precision 0.9891 0.9806 0.9918 0.9945 0.9723 0.9594 0.9748 0.9626 0.9591 0.9623 0.9817
ACC 0.9932 0.9785 0.9943 0.9955 0.9785 0.9668 0.9731 0.9705 0.9678 0.9720 0.9831
ERR 0.0068 0.0215 0.0057 0.0045 0.0215 0.0332 0.0269 0.0295 0.0322 0.0280 0.0169
3.5 3.32
3.22
3 2.95
2.8
2.69
2.5
Percentage(%)
2.15 2.15
2
1.5
1.1
1
0.68
0.57
0.5 0.45
-T
-P
ge
-svm
F
-RF
-rnd
ow
-log
-pls
A-R
-rid
RaF
RaF
ridg F-
RF
e sl
LDA
oRF
oRF
oR
oRF
oRF
oRF
PC
MP
MP
123
3870 Neural Computing and Applications (2019) 31:3851–3873
100
96.97
90
83.73
Percentage(%)
80.8 81.13
80 77.76
76.37 77.05
75.81
74.7
71.96
70 67.79
66.24
64.68 64.34
62.91 63.33
59.49 60.44
60
1
TF1
TF2
8
UF
UF
UF
UF
UF
TF3
TF3
TF3
HF
HF
HF
HF
HF
HF
HF
HF
123
Neural Computing and Applications (2019) 31:3851–3873 3871
It is observed that some of the features outperformed image database or prior knowledge such as web history.
other features with over 70% accuracy including length of Our method identifies the phishing sites based on the fea-
hostname (UF3), age of domain (TF1), Page Ranking tures extracted from URL, Website content and third-party
(TF2), search engine title (TF31), search engine copyright services using machine learning algorithms.
(TF32), common page detection ratio in website (HF3), Our model also detects phishing sites which imitate
common page detection ratio in footer (HF4), null links in legitimate sites by replacing the website content with an
website (HF5), presence of anchor links in HTML body image, which most of the anti-phishing techniques fail to
(HF7) and broken links ratio feature (HF8). Out of five detect. Our model detects zero-day phishing sites which the
proposed novel features, four of them result as top per- list-based techniques fail to detect. With the help of these
forming features in detecting phishing sites. It is also rich set of heuristic features, our model is able to achieve
observed that each feature used for training our model has high detection rate of 99.55% and low false positive rate of
an accuracy of over 60% which reveals the goodness of our 0.45% using oblique Random Forest algorithm.
features in detecting phishing websites. But it should be In the future, we intend to include additional heuristics
noted that the statistics are dependent on the quality and to detect phishing attacks containing embedded objects
quantity of training data which is further used in training such as flash, iframes and HTML files etc. Additional
the model for the classification of websites. heuristics, excluding third-party services for increasing the
efficiency of our model will be investigated. This achieves
the reduction of dependency on third-party services,
7 Limitations thereby reducing latency in the detection process.
This section explains the limitation of our model. Firstly, Acknowledgements The authors would like to thank Ministry of
Electronics & Information Technology (Meity), Government of India
our model purely depends on the quality and quantity of the for their support in part of the research. We also would like to thank
training set. Hence, training the model with a small dataset anonymous reviewers for their detailed and useful comments. We also
and having more duplicate entries in it may not give thank Zhang & Ponnuthurai for their correspondences regarding the
effective results in phishing detection. use of their oRF algorithms.
Our model uses third-party services for classification of
websites resulting in dependency on the speed of third- Compliance with ethical standards
party services. This problem can be eliminated with high
Conflict of interest The authors declare that they have no conflict of
Internet speed and alternative sources. Broken links feature interest.
extraction has a limitation of more execution time for the
websites with more number of links, for example, Wells
Fargo, USAA etc. References
Our model does not work on the websites which use
captcha checking before loading the websites because 1. Ollmann G (2004) The phishing guide. Next Generation Security
Jsoup connects to the URL to get the source code directly, Software Limited. http://www-935.ibm.com/services/us/iss/pdf/
phishing-guide-wp.pdf
does not involve handling of captcha verification. But we 2. APWG (2016) Phishing attack trends reports, fourth quarter
observed that there is no captcha verification involved in 2016. http://docs.apwg.org/reports/apwg_trends_report_q4_2016.
the phishing websites; however, we suspect that attackers pdf. Accessed 03 Mar 2017
might include captchas in phishing websites in the future. It 3. Dhamija R, Tygar JD, Hearst M (2006) Why phishing works. In:
Proceedings of the SIGCHI conference on human factors in
also fails to detect phishing sites containing embedded computing systems, ACM, pp 581–590. https://doi.org/10.1145/
objects such as javascript, flash and HTML files. Attackers 1124772.1124861
use embedded objects to replace the textual content of a 4. APWG (2016) Phishing attack trends reports, first quarter 2016.
website to bypass the anti-phishing techniques. http://docs.apwg.org/reports/apwg_trends_report_q1_2016.pdf.
Accessed 01 June 2016
5. (2014) Kaspersky lab:spam and phishing trends and statistics
report q1 2014. https://usa.kaspersky.com/internet-security-cen
8 Conclusion and future work ter/threats/spam-statistics-report-q1-2014. Accessed 15 July 2015
6. Hong J (2012) The state of phishing attacks. Commun ACM
55(1):74–81
In this paper, we have proposed a novel model for detec- 7. Cao Y, Han W, Le Y (2008) Anti-phishing based on automated
tion of phishing sites using heuristic features of websites. individual white-list. In: Proceedings of the 4th ACM workshop
We compared our work with CANTINA and CANTINA? on digital identity management, ACM, pp 51–60
to demonstrate the advantage of our model over these 8. Zhang J, Porras PA, Ullrich J (2008) Highly predictive black-
listing. In: USENIX security symposium, pp 107–122
approaches in detecting phishing sites. The features used in
our model to detect phishing does not depend on initial
123
3872 Neural Computing and Applications (2019) 31:3851–3873
9. Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: 25. APWG (2014) Global phishing reports 1st half 2014. http://docs.
predictive blacklisting to detect phishing attacks. In: INFOCOM, apwg.org/reports/APWG_Global_Phishing_Report_1H_2014.
2010 Proceedings IEEE, IEEE, pp 1–5. https://doi.org/10.1109/ pdf. Accessed 01 June 2016
INFCOM.2010.5462216 26. He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL, Chen RJ,
10. Almomani A, Wan TC, Altaher A, Manasrah A (2012) Evolving Sutanto A (2011) An efficient phishing webpage detector. Expert
fuzzy neural network for phishing emails detection. J Comput Sci systems with applications 38(10):12,018–12,027. https://doi.org/
8(7):1099 10.1016/j.eswa.2011.01.046, http://www.sciencedirect.com/sci
11. Joshi Y, Saklikar S, Das D, Saha S (2008) Phishguard: a browser ence/article/pii/S0957417411000662
plug-in for protection from phishing. In: Internet multimedia 27. Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina?: a feature-
services architecture and applications, 2008. IMSAA 2008. 2nd rich machine learning framework for detecting phishing web
International Conference on IEEE, pp 1–6. https://doi.org/10. sites. ACM Trans Inf Syst Secur (TISSEC) 14(2):21. https://doi.
1109/IMSAA.2008.4753929 org/10.1145/2019599.2019606, http://dl.acm.org/citation.cfm?doid
12. Chou N, Ledesma R, Teraguchi Y, Mitchell JC, et al (2004) =2019599.2019606
Client-side defense against web-based identity theft. In: NDSS. 28. Miyamoto D, Hazeyama H, Kadobayashi Y (2008) An evaluation
doi: 10.1.1.65.679, http://www.isoc.org/isoc/conferences/ndss/04/ of machine learning-based methods for detection of phishing
proceedings/Papers/Chou.pdf sites. In: International conference on neural information pro-
13. Shahriar H, Zulkernine M (2012) Trustworthiness testing of cessing. Springer, pp 539–546. https://doi.org/10.1007/978-3-
phishing websites: a behavior model-based approach. Future 642-02490-0_66
Generation Computer Systems 28(8):1258–1271. https://doi.org/ 29. Zhang D, Yan Z, Jiang H, Kim T (2014) A domain-feature
10.1016/j.future.2011.02.001, http://www.sciencedirect.com/sci enhanced classification model for the detection of Chinese
ence/article/pii/S0167739X11000045 phishing e-business websites. Inf Manag 51(7):845–853. https://
14. Rao RS, Ali ST (2015) Phishshield: a desktop application to doi.org/10.1016/j.im.2014.08.003, http://www.sciencedirect.com/
detect phishing webpages through heuristic approach. Proc science/article/pii/S0378720614001001
Comput Sci 54:147–156. https://doi.org/10.1016/j.procs.2015.06. 30. Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA
017 (2011) On oblique random forests. In: Joint European conference
15. Srinivasa Rao R, Pais AR (2017) Detecting phishing websites on machine learning and knowledge discovery in databases.
using automation of human behavior. In: Proceedings of the 3rd Springer, pp 453–469
ACM workshop on cyber-physical system security, ACM, New 31. Zhang L, Suganthan PN (2014) Random forests with ensemble of
York, NY, USA, CPSS ’17, pp 33–42. https://doi.org/10.1145/ feature spaces. Pattern Recogn 47(10):3429–3437
3055186.3055188 32. Zhang L, Suganthan PN (2015) Oblique decision tree ensemble
16. Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages via multisurface proximal support vector machine. IEEE Trans
with visual similarity assessment based on earth mover’s distance Cybern 45(10):2165–2176
(EMD). IEEE Trans Dependable Secur Comput 3(4):301–311 33. Gowtham R, Krishnamurthi I (2014) A comprehensive and effi-
17. Wenyin L, Huang G, Xiaoyue L, Min Z, Deng X (2005) Detec- cacious architecture for detecting phishing webpages. Comput
tion of phishing webpages based on visual similarity. In: Special Secur 40:23–37. https://doi.org/10.1016/j.cose.2013.10.004,
interest tracks and posters of the 14th international conference on http://www.sciencedirect.com/science/article/pii/S016740481300
World Wide Web, ACM, pp 1060–1061 1442
18. Hara M, Yamada A, Miyake Y (2009) Visual similarity-based 34. Mohammad RM, Thabtah F, McCluskey L (2014) Predicting
phishing detection without victim site information. In: Compu- phishing websites based on self-structuring neural network.
tational intelligence in cyber security, 2009. CICS’09. IEEE Neural Comput Appl 25(2):443–458
symposium on, IEEE, pp 30–36. https://doi.org/10.1109/ 35. Chiew KL, Chang EH, Tiong WK et al (2015) Utilisation of
CICYBS.2009.4925087 website logo for phishing detection. Comput Secur 54:16–26.
19. Rao RS, Ali ST (2015) A computer vision technique to detect https://doi.org/10.1016/j.cose.2015.07.006
phishing attacks. In: Communication systems and network tech- 36. Moghimi M, Varjani AY (2016) New rule-based phishing
nologies (CSNT), 2015 Fifth international conference on IEEE, detection method. Expert Syst Appl 53:231–242. https://doi.org/
pp 596–601. https://doi.org/10.1109/CSNT.2015.68 10.1016/j.eswa.2016.01.028
20. Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic 37. Aggarwal A, Rajadesingan A, Kumaraguru P (2012) Phishari:
classification of phishing pages. In: NDSS ’10. http://www.isoc. automatic realtime phishing detection on twitter. In: eCrime
org/isoc/conferences/ndss/10/pdf/08.pdf Researchers Summit (eCrime), 2012, IEEE, pp 1–12
21. Khonji M, Iraqi Y, Jones A (2013) Phishing detection: a literature 38. Abdelhamid N, Ayesh A, Thabtah F (2014) Phishing detection
survey. IEEE Commun Surv Tutor 15(4):2091–2121. https://doi. based associative classification data mining. Expert Syst Appl
org/10.1109/SURV.2013.032213.00009 41(13):5948–5959
22. Huh JH, Kim H (2011) Phishing detection with popular search 39. Dietterich TG (2000) Ensemble methods in machine learning. In:
engines: simple and effective. In: International symposium on International workshop on multiple classifier systems. Springer,
foundations and practice of security. Springer, pp 194–207. pp 1–15
https://doi.org/10.1007/978-3-642-27901-0_15 40. Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification
23. Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based and regression-recent developments, applications and future
approach to detecting phishing web sites. In: Proceedings of the directions [review article]. IEEE Comput Intell Mag 11(1):41–53
16th international conference on World Wide Web, ACM, 41. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014)
pp 639–648. https://doi.org/10.1145/1242572.1242659, http://dl. Do we need hundreds of classifiers to solve real world classifi-
acm.org/citation.cfm?id=1242659 cation problems. J Mach Learn Res 15(1):3133–3181
24. Pan Y, Ding X (2006) Anomaly based web phishing page 42. Akinyelu AA, Adewumi AO (2014) Classification of phishing
detection. In: Proceedings-annual computer security applications email using random forest machine learning technique. J Appl
conference, ACSAC, vol 6, pp 381–392. https://doi.org/10.1109/ Math 2014:6. https://doi.org/10.1155/2014/425731
ACSAC.2006.13
123
Neural Computing and Applications (2019) 31:3851–3873 3873
43. Fette I, Sadeh N, Tomasic A (2007) Learning to detect phishing technique. In: Internet technology and secured transactions, 2012
emails. In: Proceedings of the 16th international conference on international conference for IEEE, pp 492–497
World Wide Web, ACM, pp 649–656 49. Mohammad RM, Thabtah F, McCluskey L (2014) Intelligent
44. Dewan P, Kumaraguru P (2015) Towards automatic real time rule-based phishing websites classification. IET Inf Secur
identification of malicious posts on facebook. In: Privacy, secu- 8(3):153–160
rity and trust (PST), 2015 13th Annual Conference on IEEE, 50. Basnet RB, Sung AH, Liu Q (2011) Rule-based phishing attack
pp 85–92 detection. In: International conference on security and manage-
45. Breiman L (2001) Random forests. Mach Learn 45(1):5–32 ment (SAM 2011), Las Vegas, NV
46. Mangasarian OL, Wild EW (2006) Multisurface proximal support 51. Garera S, Provos N, Chew M, Rubin AD (2007) A framework for
vector machine classification via generalized eigenvalues. IEEE detection and measurement of phishing attacks. In: Proceedings
Trans Pattern Anal Mach Intell 28(1):69–74 of the 2007 ACM workshop on recurring malcode, ACM, pp 1–8
47. Manwani N, Sastry P (2012) Geometric decision tree. IEEE Trans 52. Ho TK (1998) The random subspace method for constructing
Syst Man Cybern Part B (Cybernetics) 42(1):181–192 decision forests. IEEE Trans Pattern Anal Mach Intell
48. Mohammad RM, Thabtah F, McCluskey L (2012) An assessment 20(8):832–844
of features related to phishing websites using an automated
123