You are on page 1of 22

computers & security 83 (2019) 246–267

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Jail-Phish: An improved search engine based


phishing detection system

Routhu Srinivasa Rao∗, Alwyn Roshan Pais


Information Security Research Lab, Department of Computer Science and Engineering, National Institute of
Technology Karnataka, Surathkal, Mangalore 575025, India

a r t i c l e i n f o a b s t r a c t

Article history: Stealing of sensitive information (username, password, credit card information and social
Received 27 April 2018 security number, etc.) using a fake webpage that imitates trusted website is termed as phish-
Revised 18 January 2019 ing. Recent techniques use search engine based approach to counter the phishing attacks
Accepted 24 February 2019 as it achieves promising detection accuracy. But, the limitation of this approach is that it
Available online 27 February 2019 fails when phishing page is hosted on compromised server. Moreover, it also results in low
true negative rate when newly registered or non-popular domains are encountered. Hence,
Keywords: in this paper, we propose an application named as Jail-Phish, which improves the accuracy
Phishing of the search engine based techniques with an ability to detect the Phishing Sites Hosted
Anti-phishing on Compromised Servers (PSHCS) and also detection of newly registered legitimate sites.
Compromised webserver Jail-Phish compares the suspicious site and matched domain in the search results for cal-
Similarity measures culating the similarity score between them. There exists some degree of similarity such as
Heuristics logos, favicons, images, scripts, styles, and anchorlinks within the pages of the same web-
Search engine services site whereas on the other side, the dissimilarity within the pages is very high in PSHCS.
Hence, we use the similarity score between the suspicious site and matched domain as a
parameter to detect the PSHCS. From the experimental results, it is observed that Jail-Phish
achieved an accuracy of 98.61%, true positive rate of 97.77% and false positive rate less than
0.64%.

© 2019 Elsevier Ltd. All rights reserved.

phishing URLs, advertisements in social networks or blogs or


1. Introduction by embedding phishing URLs in compromised websites.
If we observe the phishing statistics provided by the APWG
Phishing is carried out by tricking the online user to visit a
(2016), the number of phishing attacks has reached 1,220,523
fake website that mimics a target legitimate site. The attacker
in 2016. APWG also confirmed that this is the highest num-
designs a similar webpage either by copying or making a little
ber so far since 2004. For the simplicity, we considered the
manipulation to the legitimate page so that the online user
phishing attacks from 2010 to 2016 and can be seen in Fig. 1.
cannot differentiate between the fake and legitimate pages.
From the figure, it is observed that there is a rise in phishing
The phishing attacks can be carried out in many ways such
attacks in recent years. When the impact of phishing attacks
as email, website, malware, SMS and voice. In this work, we
is considered, the RSA (2013) online fraud report estimates
concentrate on detecting website phishing (URL), which is
a loss of over USD $5.9 billion with 450,000 phishing attacks.
achieved by making the user visit suspicious URLs. This can
Another report released by Kaspersky Lab stated that their
be achieved by many means, such as sending emails with
anti-phishing system was triggered 154,957,897 times on the


Corresponding author.
E-mail addresses: routhsrinivas.cs15fv13@nitk.edu.in (R.S. Rao), alwyn@nitk.ac.in (A.R. Pais).
https://doi.org/10.1016/j.cose.2019.02.011
0167-4048/© 2019 Elsevier Ltd. All rights reserved.
computers & security 83 (2019) 246–267 247

160000
91.20%
140000 72.75%
120000
78.81% 74.23%

No of Attacks
100000
80000
60000 76.78% 27.25%
81.13%
40000 25.77%
21.19%
23.22% 8.80%
20000 18.87%
0
2009 2010 2011 2012 2013 2014
Mal Reg Compro Reg

Fig. 2 – Comparison of phishing attacks during 2009 vs


2014.

Fig. 1 – Phishing attacks from 2010 to 2016.

detection of phishing sites. The limitation of search engine


based techniques is that they fail to detect phishing sites
systems of Kaspersky lab users in 2016. Interestingly, Kasper-
which are hosted on compromised servers. In this case, the
sky reported that in 2015 there were 148,395,446 instances
phishing pages are termed as legitimate sites leading to high
triggered the anti-phishing system. Note that, over this period
false negative rate. On the other side, the newly registered
of one year there was an increase of 6,562,451 more instances
legitimate sites are classified as phishing sites due to its
than in 2015. These figures reveal that phishing attacks
absence in search results leading to high false positive rate.
increased year after year from 2010 to 2016 accounting to
Interestingly, the number of phishing sites being hosted on
billions of dollars in loss.
compromised servers (PSHCS) is also on the rise. In the recent
Attackers design phishing sites in various ways and we
APWG (2017) first half report, they claimed that there is an
classified them into different categories such as
increase in number of phishing sites hosted on compromised
domains. According to Moore and Clayton (2007) study, it is
• C1: Phishing sites with manipulation of content, including
observed that 76% of designed phishing sites were hosted on
removal of brand names in title, copyright, metadata etc
compromised servers. If we consider the statistics of various
and replacement of legitimate hyperlinks with NULL or
APWG reports collected during the period from 2009 to 2014,1
his own local link.
it is observed that at least 70% of the phishing sites were
• C2: Phishing sites replacing entire legitimate text with a
hosted on the compromised servers every year. We confined
single image.
the statistics to a graph Fig. 2 which shows the number of
• C3: Phishing sites using URL obfuscation techniques (mis-
phishing attacks conducted through malicious registrations
spellings in brandnames or saving the brandnames in
and compromised domains. From the graph, it is seen that
either subdomain or pathdomain using an excess number
along with the rise of phishing attacks over the years the
of dots or lengthy URL).
number of phishing sites using compromised servers is also
• C4: Phishing sites with HTTPS connection to imitate high
increasing. The statistics of compromised server attacks from
legitimate behavior.
2015 to 2017 is not mentioned in the graph as it was not
• C5: Phishing sites hosted on free website hosting servers.
available in APWG 2015 to 2017 (APWG, 2017). Hence, there is
• C6: Phishing sites hosted on compromised web servers.
a need for an efficient technique to detect the phishing sites
• C7: Phishing sites with legitimate content embedded in
that are hosted on compromised servers.
iframes or flash.
The advantage of hosting a phishing site on the compro-
mised server over the malicious registration is that the de-
Note that the current phishing sites may belong to any
signed phishing site also carries the same reputation as of the
one of the above categories, or may belong to more than one
compromised domain. The reputation may include page rank-
category. For example, there can be a phishing site which
ing, visibility of domain in search engine results page (SERP),
might be of C2 category with HTTPS protocol (C4) and hosted
and the age of the domain. In addition to the above, there is no
on the compromised web server (C6).
requirement of owning the domain or resources for maintain-
There exist many heuristic based techniques which rely on
ing their phishing pages. Due to the reputation score drawn
source code, URL, machine learning and third party services
from the legitimate site, the phishing site is more likely to
for the detection of phishing sites. Out of these, third party
be alive for a longer span of time than the malicious regis-
based features have played a vital role in achieving high true
tered phishing domain. While storing the designed phishing
negative rate as the phishing sites are either less ranked or
kit in the compromised server, attackers leave the compro-
indexed by the third party services.
mised domain pages intact such that no suspicious behav-
The third party services include the use of search engine,
ior is observed by the owner of the legitimate site. The drawn
page ranking, WHOIS, etc. Many of the recent works (Chiew
reputation of the legitimate site enables the phishing page to
et al., 2015; Dunlop et al., 2010; Huh and Kim, 2011; Jain and
Gupta, 2017; Tan et al., 2016; Varshney et al., 2016a,b; Xiang
1
and Hong, 2009) focused on using search engine results for the https://www.antiphishing.org/resources/apwg-reports/.
248 computers & security 83 (2019) 246–267

bypass the list based techniques (Drew and Moore, 2014;


Prakash et al., 2010; Rao and Pais, 2017) and third-party ser-
vice based detection techniques (Chiew et al., 2015; Moghimi
and Varjani, 2016; Rao and Pais, 2018; Xiang et al., 2011; Zhang
et al., 2007).
The working of the search engine based technique is as fol-
lows. Firstly, a query string is either generated with the unique
key descriptors of suspicious website (Xiang et al., 2011; Xiang
and Hong, 2009; Zhang et al., 2007) or domain concatenated
with title (Huh and Kim, 2011; Jain and Gupta, 2017; Varshney
et al., 2016b) or an image (Chang et al., 2013; Chiew et al., 2015;
2018). The query string is fed to the search engine to output
the relevant results containing anchor links with respect to
the query. If the domain of input URL is present in the search
results, it is classified as legitimate else classified as phishing.
The search results when queried with title and domain is
shown in Fig. 3. In Fig. 3, (a) shows the designed phishing
site with its source code highlighted with title. Note that the
phishing site is hosted on the compromised server and it is
shown in (b). Feeding the query string to search engine, the
domain of phishing page is returned in search results which
classifies the input phishing URL as legitimate, as shown in
(c). On the other side, newly registered or high Alexa ranked
legitimate site is absent in the search results when queried
with title and domain as shown in Fig. 4. In Fig. 4, (a) shows
the legitimate site and its source code with title highlighted
and (b) shows the absence of domain in the search results
which leads to a classification of legitimate as phishing.
In this paper, we address the gaps found in the search
engine based techniques and some of the gaps are as follows.

• The search engine based techniques (Ramesh et al., 2014;


Tan et al., 2016; Xiang et al., 2011; Zhang et al., 2007) which
use identity keywords extracted from copyright, plaintext,
header texts fail in extracting relevant terms when the text
in phishing site is replaced with an image.
• Many search engine based techniques (Jain and Gupta,
2017; Nguyen et al., 2014; Rao and Pais, 2018; Varshney
et al., 2016b) fail when the phishing sites hosted on le-
gitimate web server are encountered. It is due to the fact
that these legitimate domains have been running for a
long time and hence there is more chance that they are
indexed by the search engine.
• Some benign sites are not displayed in the search results
when queried with domain + title or website identity
keywords. Based on this behavior, the existing techniques
(Jain and Gupta, 2017; Varshney et al., 2016b) classify
legitimate as phishing.

We address these gaps with a dynamic search query


such that the legitimate sites with no title or vague title Fig. 3 – Phishing site hosted on compromised server.
or repetitive brand in the search query are handled to re-
turn query-domain in the search results. This dynamic search
query also gives the advantage of returning high Alexa ranked • Proposed a dynamic search query string generation for re-
legitimate domain in the search results. We also include the turning the relevant search results.
similarity based mechanism to counter the phishing sites • Proposed a similarity based mechanism to detect the
which are hosted on compromised servers. phishing sites hosted on compromised servers (PSHCS).
The contributions of our proposed work are summarized • Proposed technique is robust to different URL masquerade
as follows. techniques and can also detect zero-day phishing attacks.
computers & security 83 (2019) 246–267 249

Fig. 4 – High Alexa ranked legitimate site and its absence in search results when queried with domain + title.

• Proposed technique detects non-English phishing websites The remaining section of the paper is organized as follows.
and also, it does not depend on prior access to the database The review of existing works that uses search engine services
of target website resources or web history. for the detection of phishing sites is discussed in Section 2.
• Comprehensive evaluation of the proposed work with The proposed work is explained in Section 3. Various experi-
fresh and old phishing data shows that Jail-Phish is adapt- ments are conducted on popular and non-popular legitimate
able for detection of long-lived phishing sites. sites, old and fresh phishing sites and the results are given in
250 computers & security 83 (2019) 246–267

Section 4. The effectiveness and limitations of proposed work The heuristic features are extracted from three locations
are discussed in Section 5. Finally, we conclude the paper in namely URL, source code, third party services.
Section 6. 1. URL-based techniques: These techniques extract features
from URL to detect the phishing URLs. Such techniques use
count-based features, binary features, blacklisted words to
2. Related work check if a suspicious URL is legitimate or phishing.
Techniques (Adebowale et al., 2018; Choi et al., 2011;
2.1. Categories of anti-phishing techniques Gastellier-Prevost et al., 2011; Gowtham and Krishnamurthi,
2014; He et al., 2011; Lee and Kim, 2013; Marchal et al., 2016;
There exist many anti-phishing techniques to counter the Moghimi and Varjani, 2016; Mohammad et al., 2012; Sahingoz
phishing attacks. The techniques can be placed into two cat- et al., 2018; Su et al., 2013; Thomas et al., 2011; Wang and
egories such as List and Heuristic based techniques. Shirley, 2015; Zhang et al., 2017) use count-based features such
as number of characters, hyphens, redirections, numbers and
2.1.1. List based techniques digits. Similarly some of the techniques (Gastellier-Prevost
These techniques use list of approved or unapproved re- et al., 2011; Li et al., 2018; Moghimi and Varjani, 2016; Moham-
sources (URL, DOM, images, styles, digital certificates etc.) mad et al., 2012; Shirazi et al., 2018; Verma and Dyer, 2015;
for detecting the phishing status. The techniques with ap- Xu et al., 2013; Zhao and Hoi, 2013) use binary features like
proved resources are called as whitelist (Cao et al., 2008; presence of ip address, special characters (∗ , -, /, ?), blacklisted
Hara et al., 2009; Liu et al., 2006; Rosiello et al., 2007; Tout words, brand names, and https protocol. Also, there exists
and Hafner, 2009) and unapproved resources as blacklist techniques (Pao et al., 2012; Verma and Dyer, 2015) which use
techniques (Ma et al., 2009; Prakash et al., 2010; Rao and Pais, frequency distribution of characters in the URL as parameter
2017; Zhang et al., 2008). to detect the legitimacy of the URL.
The whitelist techniques store a database of approved 2. Source code based techniques: These techniques extract
URLs (Cao et al., 2008), DOMs (Rosiello et al., 2007), images (Fu features from source code of the suspicious URLs. Features
et al., 2006; Hara et al., 2009; Rao and Ali, 2015; Zhou et al., such as hyperlinks-based features, text-based features,
2014), styles (Liu et al., 2006; Mao et al., 2017), and digital tag-based features, and image-based features are used by
certificates (Atighetchi and Pal, 2009; Tout and Hafner, 2009) various approaches. The hyperlink-based features (Marchal
for comparison with suspicious website to detect phishing et al., 2017; Rao and Pais, 2018; Shirazi et al., 2018) include
sites. If the similarity between the suspicious site and target ratio of foreign links, broken links, common URLs and null
legitimate site is above a threshold with mismatched domains links. Text-based features (Marchal et al., 2016; Ramesh et al.,
then it is classified as phishing else legitimate. In case of the 2014; Zhang et al., 2007) comprise extraction of prominent
resource like URLs, if the suspicious domain is present in the keywords from the web page and TF-IDF is one such method
stored URLs then the access is given to the suspicious site else to extract prominent keywords. Tag-based features (Rosiello
it is blocked. This behavior classifies the non-whitelisted legit- et al., 2007) include extraction of DOM trees from suspicious
imate sites as phishing sites, resulting in high false positives. website and comparison with existing DOM trees database.
On the other side, the blacklist techniques store phishing Image-based features include extraction of logos (Chiew et al.,
URLs (Ma et al., 2009; Prakash et al., 2010; Zhang et al., 2008), 2015), favicons (Chiew et al., 2018), screen shots (Hara et al.,
file contents (Britt et al., 2012; Drew and Moore, 2014; Rao and 2009) to identify the legitimacy of the website.
Pais, 2017) (CSS, Javascript, images) and DOM tags (Cui et al., 3. Third-party service based techniques: These techniques
2017) for comparison with suspicious website. If the similarity use third party service features such as WHOIS (Mohammad
between the suspicious site and target phishing site is above et al., 2012, 2014; Rao and Pais, 2018; Zhang et al., 2007)
the threshold then it is classified as phishing else legitimate. (age of the domain, registered date and expiry date of the
In the case of techniques with blacklist of URLs, a small domain), Page Rank (Basnet et al., 2011; Garera et al., 2007;
change in blacklisted URL is sufficient to bypass the filtering Mohammad et al., 2012, 2014) and search engine indexing
process. The techniques with the blacklist of file contents (Chiew et al., 2015; Dunlop et al., 2010; Jain and Gupta,
(CSS, JS, Images) fail when phishing site is generated with new 2017; Tan et al., 2016; Varshney et al., 2016b; Xiang et al.,
phishing toolkit (not blacklisted). These kinds of non black- 2011; Xiang and Hong, 2009) of the website for the phishing
listed phishing sites are also called as zero-day phishing sites. detection.
Our work falls under the category of heuristic with search
2.1.2. Heuristic based techniques engine services. Therefore, before discussing the proposed
These techniques extract most common properties from work, we give a brief description of working of search engine
existing phishing sites to detect new phishing sites. But these based techniques. These techniques use search engine re-
common properties are not guaranteed to exist in all new sults as a base to detect the legitimacy of the websites. These
phishing sites which leads to poor detection rate compared techniques extract the keywords, title, copyright, domain
to list based techniques. An attacker can bypass the heuristic from the given website’s source code. These elements are
features once he finds out the algorithm or features used used to generate a lexical signature, which is fed to search
in the detection process, thereby reaches his goal of steal- engine. The presence of Domain URL in the SERP determines
ing sensitive information. But the advantage of heuristic the legitimacy of the website.
techniques is that they can detect zero-day phishing attacks Search engine based techniques – Some of the latest and
which the list based techniques fail to detect. popular search engine based techniques are given below.
computers & security 83 (2019) 246–267 251

Cantina (Zhang et al., 2007) uses famous Term Frequency- image search to determine the identity of the website. The
Inverse Document Frequency (TF-IDF) algorithm to extract returned keyword from the image search is fed to Google
the unique keywords describing the website. These words are text search to get the search results. If the domain of query-
assembled to generate a lexical signature, fed to the search website is matched with any domain of the search results
engine to output search results. If the domain of the given then it is classified as legitimate else phishing. The authors
URL is present in the output search results, then it is classified extracted the logo by segmenting 1 × 2, 2 × 2 or 3 × 3
as legitimate else phishing. size segments from top left part of the webpage. But, this
Xiang and Hong (2009) used title, copyright and TF-IDF assumption might not be true for all the cases and moreover
terms as a search query to detect the legitimacy of the given it does not result in best fit logo extraction due to which the
URL. The authors used Named Entity Recognition module search engine might return unwanted results. The results of
to identify the brand name embedded in the copyright, title this work shown in Table 2 includes the manually extracted
and TF-IDF keywords. The brand names are given as search best fit logo as dataset.
query to the search engine to return the search results. The
presence of the suspicious domain in the search results is 2.2. Review
considered as legitimate else phishing.
Chiew et al. (2015) used logo as the search image and The existing search engine based techniques use a fixed
is fed to the Google image search interface to identify the search query string consisting of either title, domain, host-
legitimacy. The authors used machine learning algorithm to name, copyright, prominent keywords of website extracted
extract the logo from the given URL. The suspicious domain from TF-IDF or their combinations. Unfortunately, these tech-
is checked for the presence in the returned search results, if niques fail to detect phishing sites in different cases. Firstly,
found, it is classified as legitimate else phishing. the techniques which extract prominent keywords from title,
Varshney et al. (2016b) proposed a lightweight search copyright and plaintext using TF-IDF algorithm fail when
engine technique which uses domain and title extracted non-English websites are encountered. Also, the techniques
from the given URL for the detection of phishing sites. The may not give effective results when brand names are absent
authors designed a chrome extension and claimed that it is or entire textual content is replaced with an image.
lightweight as it requires a single search engine request. Secondly, techniques using title appended to the domain
Jain and Gupta (2017) proposed a two-level search engine as search query fails when phishing sites are hosted on
based technique to detect phishing sites. The authors used compromised servers. This is due to the fact that some of the
search engine mechanism similar to Varshney et al. (2016b). compromised domains are well indexed by search engines
They applied this mechanism at level one to detect phishing because of their greater age and good page rank.
sites and hyperlink based features at the next level to detect Our proposed methods fill the gaps of identifying the
newly registered and unpopular legitimate domains. The hy- phishing sites hosted on compromised servers with the
perlink features include the ratio of local (L) and null (#) hyper- similarity based mechanism, non-English phishing websites
links (N) to the total number of hyperlinks (T) in the website. If with the dynamic search query to improve both true positive
the ratio of L and T is high and the ratio of N and T is low then and true negative rates.
the technique classifies the suspicious website as legitimate Unlike list based techniques, similarity based mechanism
site. But, the technique fails when the designed phishing site in Jail-Phish does not require any database of target websites
contains all the hyperlinks assigned with a single common resources (CSS, images, DOM). Our application compares
local page, leading to L/T as 1. These type of hyperlinks are the input suspicious site with the matched domain in the
also termed as common hyperlinks (Rao and Pais, 2018). returned search results. This makes our work adaptable at
Ramesh et al. (2014) proposed a method based on the direct real time due to its lower computation time compared to
and indirect associations with the domain of the suspicious visual similarity based techniques. The summary of exist-
URL. This method uses TF-IDF for the indirect associations ing techniques with respect to various attributes such as
and anchor links for the direct associations. The intersection language independence (LI), phishing sites hosted on com-
of both these sets determines the status of the website. promised servers detection (PSHCSD), image based phishing
Dunlop et al. (2010) proposed a technique which feeds the detection (IPD), access time in seconds (AT), limitations and
snapshot of the given website to OCR software for converting brief description of the works is shown in Table 1. We also
the image to text. This technique uses the text as a search provided the size of datasets, sources and their performance
query to the search engine and then checks whether the attributes such as true positive rate (TPR), true negative rate
returned results include the domain of the suspicious URL. If (TNR), accuracy and types of websites taken for conducting
the domain is present then the URL is classified as legitimate the experimentation in Table 2.
else phishing. Most of the existing techniques (Table 2) used only popular
Tan et al. (2016) used weighted URL tokens system and legitimate sites except Jain and Gupta (2017) and Varshney
N-gram model for the extraction of identity keywords from et al. (2016b). Hence, these techniques may fail with new or
plaintext and URL. These keywords are fed to search engine non popular websites. Results demonstrate that Jail-Phish
and the returned results are checked for the presence of provides the highest accuracy among all the previous tech-
domain of the suspicious URL. If the domain is present then niques except Ramesh et al. (2014). But, it should be noted
suspicious URL is classified as legitimate else phishing. that Ramesh et al. (2014) considered only top 1000 legiti-
Chang et al. (2013) proposed a method which segments a mate sites for their experimentation and omitted testing on
logo from given suspicious website and is fed to the Google non-popular or newly registered sites. However, if only new
252
Table 1 – Summary and comparison of existing works with our work.

Technique Brief Description LI PSHCSD IPD AT Limitations

Zhang et al. (2007) Top TF-IDF keywords are extracted from No No No –


plaintext to form search query and the
returned search results are used to detect
phishing sites.
Xiang and Hong (2009) Used title, copyright and TF-IDF terms as a No Yes Yes –
search query to detect the legitimacy of the
given URL.
Ramesh et al. (2014) Proposed a method based on the direct and No Yes No 27.2s
indirect associations with the domain of the
suspicious URL.

computers & security 83 (2019) 246–267


Tan et al. (2016) Used weighted URL tokens and N gram model No Yes No –
for the extraction of identity keywords from
plaintext and URL. These keywords are fed to
search engine for phishing detection.

Dunlop et al. (2010) Proposed a technique which feeds converted text Yes Yes Yes 4.31s The performance of the technique is greatly dependent on
from website screenshot as search query to the accuracy and significance of OCR extracted texts from
the search engine to check the legitimacy. screenshot. Moreover, entire text from webpage used to
determine the website might also include unwanted text.
Chang et al. (2013) Used manually extracted best fit logo as search Yes Yes Yes –
image for extracting related keywords of the
suspicious site. The keywords are fed to search
engine to detect the status of the website
Chiew et al. (2015) Used logo as a query to Google image search for Yes Yes Yes –
identifying the phishing status.
Chiew et al. (2018) Used favicon as search image query to the Yes Yes Yes –
search engine to detect the legitimate status of
the website.
Varshney et al. (2016b) Proposed a lightweight phishing detection which No No Yes 1.53s It may fail when non-English or newly registered or sites
uses search engine results with title and hosted on compromised legitimate servers are
domain as a search query. encountered.
Jain and Gupta (2017) Proposed a two level search engine based Yes No Yes 2.36s The hyperlinks matching filter reduces the effect of search
technique which uses search engine at level engine filter. This technique might improve TNR but
one with the search query as domain + title reduce TPR when it encounters phishing site with
and hyperlink based features at next level to common hyperlinks.
increase the accuracy and TNR
Jail-Phish Target website is identified with search engine Yes Yes Yes 4.23s Our technique fails when the PSHCS is itself indexed by the
using dynamic search query and similarity search engine as shown in Figure 7
mechanism is applied between the suspicious
site and matched domain in search results.
computers & security 83 (2019) 246–267 253

Table 2 – Dataset source and performance metrics of existing works with our work.

Technique Dataset Dataset source TPR TNR ACC Websites Taken


L P L P
Zhang et al. (2007) 100 100 3Sharp PhishTank 97 94 95 only English
Xiang and Hong (2009) 3543 7906 Alexa, Yahoos Bank PhishTank 90.06 98.05 94.05 only English
Directory, 3Sharp, Google
inurl search, PhishTank
Dunlop et al. (2010) 100 100 randomWebsites, Phishtank PhishTank 100 57.6 78.8 All Kinds
Chang et al. (2013) 50 400 Alexa PhishTank 92.5 100 96.25 All Kinds
Ramesh et al. (2014) 1200 3374 Alexa, Google’s top 1000, PhishTank, 99.67 99.5 99.62 only English
Millersmile, Netcraft Reasonables.com
Chiew et al. (2015) 500 500 Alexa PhishTank 99.8 87 93.4 only logo sites
Tan et al. (2016) 5000 5000 Alexa PhishTank, 99.68 92.52 96.1 All Kinds
OpenPhish
Varshney et al. (2016b) 500 500 Alexa PhishTank 99.5 92.5 95.95 All Kinds
Jain and Gupta (2017) 2000 2000 Alexa PhishTank, 96.1 99.95 98.05 All Kinds
OpenPhish
Chiew et al. (2018) 5000 5000 Alexa PhishTank 96.93 95.87 96.4 only favicon sites
Jail-Phish 6067 5384 Alexa PhishTank 97.77 99.36 98.61 All Kinds

phishing sites (PD1) and top most legitimate sites (LD1) are (2017) with respect to initial search query (Domain + Title)
considered like Ramesh et al. (2014) then Jail-Phish achieves but these techniques have a limitation of detecting phishing
an accuracy of 99.849%. Moreover, the technique also relies sites hosted on compromised legitimate servers. The tech-
on textual content of the website for the extraction of website nique (Varshney et al., 2016b) fails when non-English web-
identity keywords using Term Frequency-Inverse Document sites are encountered whereas Jain and Gupta (2017) work fails
Frequency (TF-IDF) algorithm. when phishing site is designed with common hyperlinks. Due
In addition to Ramesh et al. (2014), there exists other tech- to the use of dynamic search query and additional similarity
niques (Ramesh et al., 2014; Tan et al., 2016; Xiang et al., 2011; computational module Jail-Phish counters PSHCS. It also does
Xiang and Hong, 2009; Zhang et al., 2007) which used TF-IDF not rely on textual content for the search query generation
algorithm for the extraction of unique prominent keywords hence it is robust to image based phishing sites too.
from the textual content of the given website. Though TF-IDF
algorithm is powerful, it has certain limitations. Firstly, the
algorithm does not guarantee the extraction of correct set 3. Proposed work
of keywords always and the techniques fail when wrong
or unwanted keywords are fed as search query. Secondly, 3.1. Motivation
it needs prior database of corpus of documents (websites)
for the calculation of inverse document frequency (IDF). Consider a phishing site hosted on a compromised domain
There exists techniques (Verma et al., 2012a,b) which skip with a generic title (log in, sign in, untitled or home) or no title
maintaining the corpus but uses number of search engine is visited by the online user. The existing search engine based
results to calculate the IDF score. But the use of search engine techniques classify the site as legitimate due to the presence
services for the calculation of IDF for each word is highly time of suspicious domain in the SERP as shown in Fig. 4. In the fig-
consuming. Thirdly, the techniques fail when phishing site is ure, it is observed that the domain and title of phishing page
designed with text replaced with an image as shown in Fig. 5. is fed as a search query to the Google search engine to get
Image A is inserted as background image in the body of the the search results. In the current existing techniques, if the
HTML as shown in C. On top of the image A, input fields and queried domain matches with one of the domain in the search
login buttons (i.e. B) are embedded to result in final desired results then the visited page is classified as a legitimate page.
phishing page C. These kind of phishing sites are termed as It is noted that more than 75% of the phishing pages are
image based phishing sites. hosted on the compromised domains as shown in Fig. 2.
The logo (Chiew et al., 2015) or favicon (Chiew et al., 2018) Hence, there is a need of technique which counters the
or screenshot (Dunlop et al., 2010) search based techniques phishing pages hosted on compromised servers. Recently,
counters the image based phishing sites but they need an ef- many search engine based techniques (Chiew et al., 2015;
fective segmentation algorithm for the logos extraction and 2018; Huh and Kim, 2011; Jain and Gupta, 2017; Varshney et al.,
prior database of target website’s logos or favicons or screen- 2016b; Xiang et al., 2011) are proposed to counter the phishing
shots. Moreover, the techniques fail when non-logo or non- attacks. Unfortunately, some of them (Varshney et al., 2016b)
favicon websites are encountered. The calculation of similar- fail to detect non-English phishing sites and most of them
ity of logos or images of suspicious site with legitimate im- (Huh and Kim, 2011; Jain and Gupta, 2017; Varshney et al.,
age database set results in high computation and storage cost. 2016b; Xiang et al., 2011) fail to detect phishing sites hosted
This makes the techniques unadaptable to fit in real time. Jail- on the compromised server. In Jain and Gupta (2017), authors
Phish is similar to Varshney et al. (2016b) and Jain and Gupta additionally included hyperlink based features to improve
254 computers & security 83 (2019) 246–267

hps://gator3231.hostgator.com/~cheaks/bost/FTEROO09K/img/bg.PNG

B
A

https://gator3231.hostgator.com/~cheaks/bost/FTEROO09K/index.php

Image inserted in
body section of
HTML

A + B = C

Fig. 5 – Phishing site with design of text replacement with screenshot.

the true negative rate (legitimate to legitimate) but this To resolve the aforementioned limitations, we propose
additional feature increased the false negative rate (phishing an efficient search engine based technique which includes
to legitimate). The hyperlink features are not guaranteed to variable search query string corresponding to the different
exist in all phishing sites. There exists some phishing sites category of websites including new and unpopular legitimate
which has either all the links redirected to the common page sites. We also use similarity based features for the detection
with its local domain or zero links. of phishing pages on compromised servers.
Note that, all these above-mentioned techniques collected
phishing sites dated on the day of experimentation. If the 3.2. Observations in domain + title search query
long-lived phishing sites which are hosted on compromised
servers are considered for the detection then the existing The reason for choosing domain + title search query is
techniques fail in most of the cases. We term the long-lived that the domain name guarantees the higher probability of
phishing sites as old phish. Note that, old phish contains both the existence of brand name in the query. The search query
long lived phish hosted on compromised server and non with the top TF-IDF keywords may not give brand information
compromised server. The existing techniques fail with old for low content website or websites with text replaced with
phish because when the compromised domain queried as a images. Hence, the domain of the website is considered for
search string then there exists more chance of it being re- the generation of search query in our proposed work. Also,
turned in the search results. As illustrated in Fig. 3, (a) shows there exists techniques (Huh and Kim, 2011; Jain and Gupta,
that phishing page hosted on the compromised domain. (c) 2017; Varshney et al., 2016b; Xiang et al., 2011) in the literature
shows that the compromised domain is fed as a search string which uses same domain + Title as search query for the
to Google where the compromised domain is returned in the detection of phishing sites. In this section, we describe some
search results. These kinds of phishing pages are classified as of the observations regarding domain +Title as search query
legitimate pages by the existing works. and are given below.
computers & security 83 (2019) 246–267 255

In legitimate sites,
Algorithm 1: Algorithm of Jail-Phish.
Input: URL of the webpage
• When queried with domain and title, some non-English
Output: Status of the webpage
sites are not shown in the search results but they appear
1: Legit _St at us ← Fal se, Domain_St at us ← False
in search results when queried with only domain.
2: Query_Domain ← E xt ract _Domain(URL )
• TLD problem: when a domain with ccTLD (country code
3: Parse the Source code extracted from URL
top level domain) is queried then domain with gTLD
4: Query_URL_DOM ← getDOM(URL )
(global top level domain) is shown in the search results.
5: Query_baseURL ← getbaseU RL(U RL )
• If the title is absent in brand name or keywords in the title
6: Query_T it l e ← E xt ract _t it l e(Source_Code )
are more competitive then the query domain does not 7: if Query_T it l e! = null then
appear in the search results. 8: Search_query ← Query_Domain + Query_T it l e
9: Top_10_Search_result s[] ←
In phishing sites, Per form_search(Search_query )
10: ∀n ∈ Top_10_Search_result s
• When title is absent in phishing sites hosted on compro- 11: Domaini ← E xt ract _Domain(n )
mised server (PSHCS) then query-domain is shown in the 12: baseURLi ← getbaseURL(n )
search results. 13: for i ← 1 to 10 do
• When the titles of PSHCS have regular keywords (sign in, 14: if Query_Domain == Domaini then
log in, login, home, untitled) then query-domain is shown 15: Domain_St at us ← True
in the search results. 16: if Query_baseURL == baseURLi then
17: Announce Legitimate
3.3. Jail-Phish 18: end if
19: Matched_Domain_DOM ← getDOM(n )
The basic idea of our work is simple but effective in detecting /* where n is the URL of matched domain
phishing sites which is explained as follows.
in search results */
20: end if
Our tool is implemented in Java which takes URL as input
21: end for
and extracts the title, domain and source code from the
22: if Domain_St at us == False then
website. The extracted domain and title are appended to
23: Search_Query = Query_Domain
generate a search query. On feeding the search query, the top
24: else
n search results are considered for checking the query domain
25: goto step 40
in the search results. If the query domain is present in search 26: end if
results then the similarity between the query URL content 27: else
and matched domain in SERP are calculated. If the similarity 28: Search_Query ← Query_Domain
is above a threshold then it is classified as legitimate else it is 29: end if
PSHCS. 30: Top_10_Search_result s[] ←
The other possibility is that the query domain is absent in Per form_search(Search_query )
search results and is eventually classified as phishing sites. 31: ∀n ∈ Top_10_Search_result s
This assumption also leads to incorrect classification of newly 32: Domaini ← E xt ract _Domain(n )
registered or non-popular legitimate sites as phishing but 33: for i ← 1 to 10 do
due to the usage of dynamic search query the false positive 34: if Query_Domain == Domaini then
rate (legitimate to phishing) is reduced to a major extent. 35: Domain_St at us ← True
The method is valid due to the fact that legitimate sites 36: Matched_Domain_DOM ← getDOM(n )
are more likely to be indexed by most of the search engines 37: end if
whereas phishing sites are not, due to their short lifespan 38: end for
and zero incoming links. 39: if Domain_St at us == True then
40: Simil arit y_Score ←
3.3.1. The objectives of our proposed work are given as follows Comput e_Simil arit y(Mat ched_Domain_DOM,
• Language independent solution – The existing search-engine
Query_URL_DOM )
41: if Simil ait y_Score > T hreshold then
based techniques (Marchal et al., 2016; Ramesh et al., 2014;
42: Legit _St at us ← True
Xiang and Hong, 2009; Zhang et al., 2007) use website iden-
43: else
tity keywords as a search query for the identification of
44: Legit _St at us ← False
target website or detection of legitimacy of the website. Un-
45: end if
fortunately, the popular search engines like Google, Bing,
46: if Legit _St at us == True then
Yahoo do not provide desired results in case of non-English 47: Announce Legitimate
keywords in the query string. Therefore there is a need of 48: else
a technique which queries the non-english keywords. 49: Announce Phishing
• Efficient query formation – As mentioned in gaps in previous 50: end if
section, if search query contains long URL or non-related 51: else
keywords, the search engine might return undesired re- 52: Announce Phishing
sults or empty results which might lead to false positives or 53: end if
256 computers & security 83 (2019) 246–267

false negatives. Hence there is a need of an efficient and dy- 99


namic query corresponding to different types of websites.
98
• Detection of compromised domains – The existing techniques
97
(Jain and Gupta, 2017; Ramesh et al., 2014; Varshney et al.,

Percentage
96
2016b) fail to detect phishing sites hosted on compromised
95
servers. As the compromised domains are legitimate and
94 TNR
live for a long time, they appear in search results when
93
queried with domain and title or only domain.
92
• Client-side solution – The techniques should include
91
lightweight features such that they are adaptable at client
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
side. The techniques (Chang et al., 2013; Chiew et al., 2015;
Threshold (T)
2018; Dunlop et al., 2010) include logo, segmentation of
images, image to text conversion as a search query for the
detection. But, these include more computation cost for Fig. 6 – Comparison of TNR at different Threshold.
efficient logo extraction, segmentation and may even lead
to poor results when non-logo or non-English websites are
encountered. (sign in, log in, login, untitled, home) as titles in PSHCS results
in false negative rate (FNR). This limitation is countered by
3.3.2. Stages in Jail-Phish the similarity based mechanism which is discussed in Stage 5.
We divide our proposed work into 6 stages and are given as Initially domain + title is considered as search query.
follows This search query is modified at two instances. Firstly, if the
Stage 1: Extraction of domain and title – In this module, title domains of returned search results do not match with query
and domain are extracted from the given suspicious URL. We domain then search query is changed to only query domain.
have used Selenium WebDriver2 to visit the given URL and Secondly, if there exists no title in the source code of the
thereby the source code of the suspicious URL is extracted by given URL then also search query is changed to only query
the same WebDriver. Selenium is a web browser automation domain.
tool which opens Firefox or chrome browser externally to Stage 3: Search processing – This stage feeds the assembled
perform browser like action. Jsoup library is used to parse the search query to search engine for returning the relevant
source code to extract the title and other relevant DOM ele- search results. Google is used as a search engine for our
ments. Jsoup3 is a java library which is used to parse and ma- experimentation. Our technique is adaptable to any search
nipulate the HTML code of a given website. The reason behind engine such as Bing, Yahoo or Baidu but based on the prior
choosing the Jsoup library is due to its availability of API which study by Huh and Kim (2011), Varshney et al. (2016b) and
provides functionality to select elements of Document object Jain and Gupta (2017) techniques in anti-phishing domain,
model (DOM) and also it handles non-well formed HTML very Google is chosen for the experimentation. Based on the query
effectively. We have also used Google Guava library for the string, the number of returned results would be changing.
extraction of Domain from the URL. The baseURL is calculated From the experimental results, we observed that the optimal
by appending the hostname and pathname of the given URL value of search results threshold T = 10 as shown in Fig. 6. We
which is further used for the classification of websites. have considered 1500 legitimate sites randomly chosen from
Stage 2: Search query string preparation – This module pro- LD1-LD3 and 1500 phishing sites from PhishTank as dataset
vides dynamic search query string for different kinds of for the experimentation. To find the optimal T, we considered
websites. As discussed in the previous section, we choose various T values ranging from 1 to 15 and search query =
domain + title as a search query for performing the domain + title to calculate the TNR on the dataset. It
searching mechanism. Generally, domain gives brand name is observed that the JailPhish has achieved maximum TNR
of the respective website and title gives a brief description of 97.87% at threshold T=10. We also conducted experiment
about the website to online users and search engines. Based on phishing sites for calculating optimal T with greater
on the experimentation, we observed some issues pertaining TPR. From the results, it is observed that TPR of 99.67% was
to search query for the Google search engine which are given achieved at all values of T. Hence, we chose the threshold of
in Section 3.2. search results as 10. It should be noted that small value of
Overall, we conclude that title is a must in the search query threshold may miss the matching domain which might lead
for some set of legitimate sites and should be skipped for other to high false positive rate and large value of threshold may
set of legitimate sites. Hence, we consider the search query result in unnecessary matching of search results thus leading
as either domain + title or domain to obtain the results. to increase in computation cost. Majority of the existing tech-
But the limitation of choosing only domain is that, phish- niques (Jain and Gupta, 2017; Ramesh et al., 2014; Varshney
ing sites hosted on compromised servers will also be returned et al., 2016b) used less than or equal to 10 search results.
in search results which leads to high false negative rate Due to the shutdown of Google global site search,4 we have
(phishing to legitimate). Also, domain with regular keywords automated the google searching process using Selenium Web-
Driver and the search results are parsed using Jsoup.

2
http://docs.seleniumhq.org/download/.
3 4
https://jsoup.org/. https://enterprise.google.com/search/products/gss.html.
computers & security 83 (2019) 246–267 257

Stage 4: Initial decision making – On feeding the query string


Algorithm 2: Similarity computation algorithm of Jail-
to Google, if the query domain is returned in the top 10 search
Phish.
results then domain status is changed to true which indicates
Input: Matched_Domain_DOM, Query_URL_DOM
that the query domain is either a pure legitimate page or a
Output: Similarity_Score
PSHCS. To identify the status of pure legitimacy, the baseURL
1: Feed the Domain_DOM to Jsoup API for parsing the
of the matched domain is compared with baseURL of the
source code.
query domain. If matched then the queried domain of the URL
2: Extract the similarity based features (URL, Styles,
is classified as legitimate and execution of further steps are
Javascript, Images) required for the detection of
skipped else query domain undergoes next filtering. As most
phishing website hosted on compromised server.
of the popular legitimate sites are better indexed in Google,
a) Save all the features in an ArrayList where each
they fall into the above category i.e. they are early detected as
feature consists of set of words.
legitimate sites as shown in Fig. 9. To differentiate between
3: Feed the Query_URL_DOM to Jsoup API for parsing the
legitimate page and PSHCS, we perform similarity computa-
source code.
tion between the matched domain and given suspicious page
4: repeat step 2
as explained in Stage 5. There exists some query domains
5: Combine all the resources with respect to each page i.e.
which matches the primary domains in search results but
Matched page M and suspicious page S in two different
differs in country code top level domain (ccTLD) or global top
sets
level domain (gTLD). These kind of websites also undergoes
6: Calculate the Jaccard similarity coefficient for M and S.
similarity computation stage. Note that, websites with same
domain but different ccTLD or gTLD does not guarantee that |M ∩ S|
both of them belong to same organization or company. The J(M, S ) = (1)
|M| + |S| − |M ∩ S|
combination of primary domain and gTLD or ccTLD or both
contribute to domain of the URL. For example, in URL https:// where J(M,S)∈ [0,1] and J(M,S)=1 when both M and S are
www.regist.login.nextpm.com.ru/web/log/secure/index.php, empty.
we have, protocol as https, subdomain as regist.login, primary 7: Return Similarity_Score=J(M,S)
domain as nextpm, gTLD as com, ccTLD as ru, Domain as
nextpm.com.ru, Hostname as regist.login.nextpm.com.ru
and pathname as web/log/secure/index.php
If the query domain is not matched with any domains F4: Images – We extract all the images corresponding to
of the returned results then query string is reset with only both pages from the inspection of src attribute of img tag
domain to perform search processing (Stage 3). Even then, if i.e. img[src]. These files can have various extension such
the query domain is absent in returned search results then as png, ico, jpeg, jpg etc. The pages within the website might
it is classified as phishing else the matched domain URL mostly involve some of the common images like logos or
undergoes further stages of filtering. favicons.
Stage 5: Similarity computation – This module checks Once all these resource URLs from JS, CSS, images, anchor-
whether the given URL is a phishing page hosted on compro- links are extracted, they are combined to form two large sets
mised server or not. Usually, there exists a level of similarity corresponding to M and S. These two sets are fed as input
between the pages within the website. This hypothesis is to Jaccardian similarity module to calculate the similarity
used to detect the PSHCS. There exist some resources like between the pages.
Javascript, CSS, images, logos common to all the pages within Stage 6: Final decision making – In this module, the final
the website. Hence, we extract these resource files from the decision on the status of the given URL is taken based on the
suspicious website S and also matched domain M in search similarity score. If the similarity score is above the threshold
results to calculate the similarity between them. The similar- then it is classified as legitimate. For simplicity, we have taken
ity computation mechanism is given in Algorithm 2 and the threshold as 0 i.e. if the similarity score is above 0 indicating
similarity based features are given as below. that there exists some level of similarity then it is classified
F1: URL – We extract all the URLs which are local corre- as legitimate else it is classified as a phishing site hosted on
sponding to S and M from their respective source codes. The compromised server. The reason for choosing the threshold
local anchorlinks are extracted from the inspection code with as 0 is that legitimate pages within a website have some
a[href] and are saved in different sets for each page. level of similarity including either images, scripts anchorlinks
F2: Styles – We extract all the local styles corresponding to or styles. We are aware that this threshold also makes the
M and S from the inspection code with link[href]. Note phishing sites bypass the detection techniques when they
that, these styles files contain CSS extensions in the link reuse the resources of compromised legitimate server in their
URL. These style files describes the visual look of the HTML designed phishing site. But, we observed zero instances of
elements displayed in the browser. phishing sites which reused the files of compromised legiti-
F3: Javascript – We extract all the javascript files embedded mate server. The other reason for choosing the threshold as 0
in both the pages and are stored in different sets. These files is to reduce the number of false positives, because there exists
have js extension and are extracted from inspection of src some legitimate sites where there is low level of similarity
attribute of script tag i.e. script[src]. These files provide between the pages within website. For example, similarity
dynamic actions to the website. between home page and about page of same website may not
include large number of common files (scripts, anchorlinks,
258 computers & security 83 (2019) 246–267

Table 3 – Collected dataset information.

LD1 LD2 LD3 PD1 PD2


Raw instances 1000 3000 3000 3770 7170
Invalid websites including duplicate 150 369 414 1115 4441
Final dataset 850 2631 2586 2655 2729
Alexa Ranking (for legitimate) 1–1000 1000–100,000 500,000–1,000,000 – –
Phishing submission dates – – – 2–10 Sept, 2018 01 Jan–30 Nov, 2017

CSS or images). Hence, Jail-Phish is designed to uncover some • True Positive Rate (TPR): % of correctly predicted phishing
phishing sites but not to misclassify the legitimate sites as sites (TP) out of total number of phishing sites. It is also
phishing. Interestingly, the selected threshold score resulted termed as Sensitivity or Recall.
in promising true negative rate of 99.38% and true positive
rate of 97.53% as explained in next section. TP
T PR = ∗ 100 (1)
TP + FN

• True Negative Rate (TNR): % of correctly predicted legitimate


4. Experimental evaluation sites (TN) out of total number of legitimate sites. It is also
termed as Specificity.
4.1. Dataset
TN
T NR = ∗ 100 (2)
To evaluate the performance of Jail-Phish, we used real TN + FP
datasets from two sources. The phishing dataset is extracted
• False Positive Rate(FPR): % of incorrectly predicted legitimate
from PhishTank and the legitimate dataset is collected from
sites (FP) out of total number of legitimate sites.
Alexa. The properties of the datasets are given in Table 3.
The phishing dataset is collected at different time interval
FP
to check whether the techniques are adaptable to old and F PR = ∗ 100 = 1 − Speci f icity (3)
FP + TN
live phishing sites. Similarly, the legitimate datasets are also
collected at different ranges of Alexa Page ranks to check the • False Negative Rate(FNR): % of incorrectly predicted phishing
performance of our application with both popular, medium sites (FN) out of total number of phishing sites.
and newly registered domains.
We divided our datasets into 5 sets namely FN
F NR = ∗ 100 = 1 − Recall (4)
FN + TP
• LD1- list of low ranked legitimate sites with Alexa Rank • Accuracy (Acc): % of correctly predicted legitimate and
ranging from 1 to 1000. phishing sites out of total number of websites.
• LD2- list of medium ranked legitimate sites with Alexa
Rank ranging from 1000 to 100,000. TP + TN TP + TN
Acc = ∗ 100 = ∗ 100 (5)
• LD3- list of high ranked legitimate sites with Alexa Rank TP + FP + TN + FN P+L
ranging from 500,000 to 1,000,000.
• PD1- list of new phishing sites collected during 02–10 Sept, • Precision (Pre): % of correctly predicted phishing sites out of
2018. total number of predicted phishing sites.
• PD2- list of old phishing sites collected during 01 Jan–30
TP
Nov, 2017. Pre = ∗ 100 (6)
TP + FP

Note that, low Alexa ranked sites are popular websites • F-Measure: It is harmonic mean of precision and recall. The
whereas high Alexa ranked sites are either unpopular or F-Measure will always be nearer to the smaller value of Pre-
newly registered sites. cision or Recall. It is a combined measure that assesses the
precision and recall tradeoff.
4.2. Evaluation metrics
Pre ∗ Recall
F =2∗ (7)
Pre + Recall
We used seven traditional metrics to identify the performance
of our system. We considered condition positive as phishing 4.3. Experiment 1: Evaluation of Jail-Phish with
(P) and negative as legitimate (L). Correctly predicted phishing legitimate datasets
sites is termed as hit or true positive (TP), correctly predicted
legitimate sites as true negative (TN), incorrectly predicted In this experiment, we attempt to identify the true negative
legitimate sites as false positive (FP), incorrectly predicted rate of our application. This experiment is carried out using
phishing sites as false negative (FN). For a better system, false different legitimate sets namely LD1, LD2 and LD3. The motive
positive should be minimum and true positive should be of conducting experiment on different variations of legitimate
maximum.
computers & security 83 (2019) 246–267 259

Table 4 – Evaluation of Jail-Phish with LD1, LD2 and LD3 and comparison with M1 and M2.

Datasets Jail-Phish M1 M2
LD1 LD2 LD3 LD1 LD2 LD3 LD1 LD2 LD3
Final number of instances 850 2631 2586 850 2631 2586 850 2631 2586
Non-English instances 284 1323 1383 284 1323 1383 284 1323 1383
Misclassified instances 0 16 23 20 226 414 7 98 163
Misclassified non-English instances 0 7 9 5 119 220 1 42 75
False Positive Rate 0 0.6082 0.8894 2.3529 8.5899 16.01 0.8235 3.7248 6.3032
True Negative Rate 100 99.3918 99.1106 97.6471 91.4101 83.99 99.1765 96.2752 93.6968
Average TNR 99.5008 91.0157 96.3828

sites is to observe the behavior of our application when en- 96.38%. Jail-Phish and M2 performed better than M1 due
countered with popular and newly registered or non-popular to the use of GWS as it includes real time search results.
sites. From the experimental results, we observed that Jail- Since M1 uses GCSE, it resulted in very poor TNR and high
Phish achieved TNR of 100% with LD1, 99.39% with LD2 and FPR. M1 with LD3 reached the maximum FPR of 16.01%. The
99.11% with LD3 datasets, respectively, as shown in Table 4. difference of TNR between Jail-Phish and M1, M2 increases
The average TNR of 99.50% shows the efficiency of Google with the increase in high ranked websites and inclusion of
search engine to detect the legitimate domains. The decrease non-English websites. This difference is due to the unavail-
in TNR from LD1 through LD3 is due to the fact that low Alexa ability of efficient search query applied for different kinds of
ranked sites are better indexed by the search engines than the websites. Not all the time, title and domain concatenation
high ranked sites. Note that, for a better anti-phishing system would be useful for identifying the legitimate sites because
usable in real world scenario, the TNR should be very high i.e. some times high Alexa ranked sites may not get returned in
misclassification of legitimate to phishing sites should be low. the search results when title is appended to domain but it
might work when only domain is used as search query. Some
4.4. Experiment 2: Comparison of Jail-Phish with legitimate sites use vague titles like untitled, home, and index
Varshney et al. (2016b) with legitimate datasets etc. leading to their absence in SERP.

We also implemented Varshney et al. (2016b) work on the 4.5. Experiment 3: Evaluation of Jail-Phish with phishing
above mentioned legitimate datasets (LD1, LD2 and LD3) for datasets
the comparison with our work. The main reason for choosing
the Varshney et al. work over the other works is due to the In this experiment, we test our application on two phishing
similarity with our proposed method, where the main idea datasets namely PD1 and PD2. The dataset PD1 contains the
is to leverage the Google search engine with search query as fresh phishing sites which are collected from 2 Sept 2018 (the
domain + title rather than keywords extracted from the day of experimentation) to 10 Sept 2018. The dataset PD2
textual content. As mentioned earlier, keywords extracted contains old phishing sites which were still alive on the day
from textual elements have constraints of language depen- of experimentation i.e. 10 Sept 2018. The old phishing sites
dency and failing cases with image based phishing sites are collected with submission dates starting from Jan 1, 2017
(Fig. 5). Varshney et al. (2016b) used Google custom search to Nov 30, 2017. The reason for choosing different phishing
engine (GCSE) for their experimentation but the weakness datasets at different time slots is that, the phishing sites
of GCSE is that it does not include universal search and real which are alive for a longer time has a greater chance to be
time results.5 Due to this behavior, the technique results in indexed by the search engines. The phishing sites hosted on
high false positive rate when non-popular or newly registered compromised servers might live for a longer time when not
domains are encountered. This technique also includes static noticed by the owner of the website. This leads the phishing
search query including domain + title but it fails to sites to be indexed by the search engines. We countered the
detect legitimate sites with empty or vague titles. phishing sites which are hosted on compromised sites with
To determine the importance of Google web search, we similarity based features.
also implemented the Varshney work with Google web search We attempted to find the percentage of PSHCS, phishing
(GWS). To ease the presentation, we term the Varshney work sites hosted on free hosting server (PSHFHS) in both phishing
with GCSE and GWS as M1 and M2, respectively. GWS is datasets PD1 and PD2 by manually checking the suspicious
developed using Selenium Webdriver and the search results URL and its home page. The proportions of PSHCS and PSHFHS
are parsed using JSoup library. are given in Table 5. The table also includes total instances of
The results are given in Table 4. It is observed that Jail- non-English phishing sites, misclassified instances of PSHCS,
Phish outperformed M1 and M2 with an average TNR of PSHFHS, non-English phishing sites with Jail-Phish, M1 and
99.50%. M2 ranked better than M1 with an average TNR of M2. From the results, we found that 1889 out of 2655 are
PSHCS in PD1 contributing to 71.15% of total websites and
5
https://support.google.com/customsearch/answer/70392?hl= 2237 out of 2729 as PSHCS in PD2 contributing to 81.97% of
en. total websites. Note that, phishing sites which are hosted
260 computers & security 83 (2019) 246–267

Table 5 – Evaluation of Jail-Phish with PD1, PD2 and comparison with M1 and M2.

Datasets Our work M1 M2


PD1 PD2 PD1 PD2 PD1 PD2
Final number of instances 2655 2729 2655 2729 2655 2729
Non-English instances 796 963 796 963 796 963
PSHCS instances 1889 2237 1889 2237 1889 2237
PSHFHS instances 350 113 350 113 350 113
Misclassified instances 9 111 94 141 173 328
Misclassified Non-English instances 3 31 16 28 34 62
Misclassified PSHCS 4 67 19 94 48 223
Misclassified PSHFHS 2 29 33 39 61 56
False Negative Rate 0.3390 4.0674 3.5404 5.1667 6.5160 12.0191
True Positive Rate 99.6610 95.9326 96.4596 94.8333 93.484 87.9809
Average TPR 97.7968 95.6465 90.7325

on free hosting servers are not considered as PSHCS as the to M2 with GWS. Despite the use of GWS, our Jail-Phish out-
domain is not compromised rather it provides a service for performed the other works due to the additional component
free hosting. From the experimental results of Jail-Phish, we of similarity computation between the suspicious site and
observed that there exists only 9 false negatives leading to matched site in the search results. Table 5 also shows that
99.66% of TPR. The higher TPR on fresh phishing sites shows Jail-Phish could achieve TPR of 99.6610% for PD1, which drops
that our application is highly adaptable to detection of fresh to 95.9326% for PD2. The reason for the drop is due to the
phishing sites in real time including phishing sites developed limitation of the search engine i.e. if the search engine itself
with vague titles. We also tested our application on PD2 which indexed the phishing site as shown in Fig. 7 then Jail-Phish
resulted in 95.93% TPR. would misclassify the suspicious site as legitimate. Although,
The difference of TPR with PD1 and PD2 is due to the TPR of Jail-Phish with PD2 is lesser than PD1 but still it can
indexing of phishing sites in the search engine. Because, be considered as significant detection rate. Note that, TPR
when the phishing site itself is indexed by the Google search of Jail-Phish performed better than both M1 and M2 with a
engine as shown in Fig. 7 then the Jail-Phish classifies the significant difference as shown in Table 5. Even with respect
site as legitimate (false negative). There exists three types of to non-English and PSHCS sites, Jail-Phish outperformed M1
phishing sites which can be indexed by the search engine. and M2 with a significant detection rate.
Firstly, PSHCS, the behavior of phishing sites with Google
indexing is mostly observed in PSHCS as shown in Table 5. 4.7. Experiment 5: Evaluation of similarity computation
From the Table, it is evident that majority of the misclassified module
instances are PSHCS i.e. 4 out of 9 in PD1 and 67 out of 111
in PD2. From these results, we conclude that there exists In this experiment, we attempt to identify the importance
more chance of PSHCS to be indexed by search engine and of similarity computation module in detecting phishing
thus leads to misclassification. The Jail-Phish has misclas- websites. Jail-Phish checks the domain of the suspicious
sified only 4 PSHCS out of 1889 and 67 out of 2237 PSHCS URL in SERP with domain + title as search query, if not
showing the importance of similarity computation module in found, then it again checks the domain of suspicious URL in
detecting the PSHCS. Secondly, PSHFHS, similar to the PSHCS, SERP with only domain. If there is a match in SERP then the
PSHFHS has a chance to get indexed by the Google but the similarity is computed between the suspicious domain and
number of such instances is not so high compared to PSHCS. matched domain.
We observed two misclassified PSHFHS in PD1 and 29 in PD2. Jail-Phish without similarity computation module is di-
Finally, phishing sites hosted on paid server might get indexed vided into two cases. Case 1: With SQ= domain + title,
by the search engine but the number is very low. In Jail-Phish, Jail-Phish checks the matching status of suspicious Domain
we have encountered only 3 instances (total misclassified - with domains in SERP, if matched then legitimate else phish-
misclassified(PSHCS+PSHFHS)) in PD1 and 15 in PD2. ing. Case 2: Initially, with SQ=domain + title is checked
for matched domains in SERP if not found then SQ=domain is
4.6. Experiment 4: Comparison of Jail-Phish with M1 and used for the matching of suspicious domain in SERP. If match
M2 with phishing datasets found then legitimate else phishing. Jail-Phish with inclusion
of similarity computation module is considered as Case 3.
We compared our work with M1 and M2 with respect to PD1 The results of each case are given in Table 6. Case 3 is
and PD2 datasets. The results are shown in Table 5. From the already discussed in previous experiments whereas Case 1
results, it is observed that Jail-Phish outperformed the exist- is good for phishing sites detection but fails in detecting non-
ing works with an average TPR of 97.7968%. M1 ranked better English and non-popular sites. To overcome the limitation
than M2 with an average TPR of 95.6465%. As M1 uses GCSE, of Case 1, search processing is extended with SQ=domain
it excludes real time results hence the probability of new for the domains absent in SERP with SQ=domain + title.
phishing sites resulting in search results is lower compared This might help in improving the detection of legitimate sites
computers & security 83 (2019) 246–267 261

Fig. 7 – Phishing site hosted on compromised server and indexed by search engine.

Table 6 – Evaluation of Jail-Phish with and without similarity computation module.

LD1 LD2 LD3 PD1 PD2 Total misclassified instances


Final dataset 850 2631 2586 2655 2729 –
Misclassified instances with Case 1 7 98 163 173 328 769
Misclassifed instances with Case 2 0 16 23 2249 2350 4638
Misclassified instances with Case 3 0 16 23 9 111 159

but it also creates a major problem of classifying the PSHFHS 4.8. Experiment 6: Overall comparison of existing works
and PSHCS sites as legitimate. Since most of the compro- with Jail-Phish
mised and free domains are popular, they are returned in
SERP when queried with only domain leading to classification From the earlier experimental results, it is shown that our
as legitimate. From Table 5, it is observed that 1889 PSHCS work outperformed existing work with a significant gain in
and 350 PSHFHS instances are present in PD1. When these TNR and TPR with respect to various legitimate (LD1–LD3)
sites are tested with Case C2, all of them are classified as and phishing datasets (PD1–PD2). In this section, we compare
legitimate due to their indexing in Google. Also, we observed Jail-Phish with the existing works with respect to overall
10 instances which are non-PSHCS and non-PSHFHS clas- legitimate and phishing sites including LD1 to LD3 and PD1–
sified as legitimate. Overall misclassified instances in PD1 PD2. The metrics which are mentioned earlier are used for
with Case C2 reached to 2249 (1889+350+10). Similarly, PD2 comparison. Before discussing the metrics on overall data,
with Case C2 reached to 2350 due to the existence of PSHCS we show the statistics of percentage gain in TPR and TNR of
(2237) and PSHFHS (113). Overall, the number of misclassified both phishing datasets and legitimate datasets with respect
instances with Case C3 is the least among the other Cases to our work and existing works. From Fig. 8 (a), it is observed
which demonstrates the significance of inclusion of similarity that Jail-Phish has very high percentage gain in TNR with
computation module in Jail-Phish. LD3 when compared with M1. This shows that our proposed
262 computers & security 83 (2019) 246–267

16 15.1206 9
7.9517
14 8
7 6.177
12 5.4138
6
10
Percenatge

Percenatge
7.9817 5
8 3.1166
4
6
3
4 2.3529 3.2014 2
0.8235
2 1.0993 1
0 0
LD1 LD2 LD3 PD1 PD2 LD1 LD2 LD3 PD1 PD2

TNR TPR TNR TPR

(a) Gain in TPR and TNR with M1 (b) Gain in TPR and TNR with M2

Fig. 8 – Gain in TPR and TNR with existing works M1 and M2.

Table 7 – All metrics comparison of our with existing works.

Metrics Jail-Phish M1 M2
# of legitimate instances (LD1+LD2+LD3) 6067 6067 6067
# of phishing instances (PD1+PD2) 5384 5384 5384
True Negative Rate 99.36 89.12 95.58
True Positive Rate 97.77 95.64 90.69
False Positive Rate 0.64 10.88 4.42
False Negative Rate 2.23 4.36 9.31
Precision 99.26 88.64 94.80
F Measure 98.51 92.00 92.70
Acc 98.61 92.18 93.28

method performs effectively when non-popular or newly 99.26% of phishing sites out of total predicted phishing sites.
registered legitimate sites are encountered. Also, we observed Since, the total instances in legitimate and phishing classes
that there is a significant improvement in detecting old (PD2) are not balanced but are equally important, we considered
and new phishing sites (PD1) with a percentage gain of TNR F-measure to evaluate our model. Jail-Phish has achieved a
of 1.0993% and 3.2014%, respectively. Similarly, in Fig. 8 (b), it F-measure of 98.51% and performed better than the existing
is observed that our application performed the best with old M1 and M2 methods. The TPR of M1 is somewhat closer to
phishing sites with a percentage gain of TPR of 7.9517%. This Jail-Phish due to the use of GCSE whereas TNR of M2 is closer
significant improvement is achieved due to the additional to Jail-Phish due to the use of GWS. The promising results of
component of similarity computation mechanism which is Jail-Phish with low FPR of 0.64% and high TPR of 97.77% jus-
used for detecting phishing sites hosted on compromised tifies the effectiveness and feasibility to detect phishing sites
servers. From the previous Experiment 5, it is evident that using search engines. Note that, for phishing detection, in a
similarity computation module played a significant role in de- real word scenario, a model is considered usable when it has
tecting PSHCS. The results in Table 6 shows that the technique very low FPR (legitimate to phishing) and high F-measure (Jain
with Case 3(Jail-Phish) achieved least misclassification rate and Gupta, 2017; Sheng et al., 2009; Whittaker et al., 2010).
followed by Case 1 (M2). Also, Fig. 8 shows that there is very
less difference of performances in TPR of both M2 and Jail-
Phish when tested with LD1. This is due to the fact that the 5. Discussion
popular legitimate sites are returned in search results even
with an inefficient search query such as domain with vague 5.1. Jail-Phish chrome extension
titles. For the popular legitimate sites, only single domain as
search query is sufficient to return in the search results. Our goal is to provide real time protection from website
In order to get a clearer and better understanding of de- phishing attacks. Hence, we built a chrome extension which
tection performance, we considered accuracy, precision and F classifies the visited URL as a phishing or legitimate. The
measure as additional performance metrics. The TPR, FPR for extension is designed such that no extra clicks or key press
combined PD1, PD2 and LD1, LD2, LD3 for Jail-Phish, M1 and are required i.e. on single click, the extension displays the
M2 are calculated and are given in Table 7. From the table, it status of website as legitimate or phishing. The chrome
is observed that Jail-Phish outperformed M1 and M2 with an extension is written in Javascript which extracts URL, Title
accuracy of 98.61% followed by M2 with 93.28%. The precision and DOM of visited website from the browser and makes a
in the table 99.26% shows that Jail-Phish correctly predicted connection to REST API running at the remote server where
computers & security 83 (2019) 246–267 263

Table 8 – Average time elapsed for the detection process.

Search query baseURL matched in SERP Domain matched in SERP No domain matching in SERP
SQ=domain + title 3.41 s (Case C1) 7.52 s (Case C2) –
SQ=domain – 9.89 s (Case C3) 5.63 s (Case C4)

the actual execution of technique takes place. The REST API is and very less number in C2. The non-English websites and
hosted on an Intel Xeon 16 core Ubuntu server with 2.67 GHz non popular legitimate fell into the category of Case C3.
processor and 16GB RAM. Finally, majority of the phishing sites fell under the category
The REST API feeds the above values to our Jail-Phish of Case C4 which took less time than C2 and C3. The time
application running at the remote server and thus proceeds delays for the legitimate sites can be further reduced with
with 6 stage execution (Section 3.3.2) for the phishing de- help of whitelist of legitimate domains. Jail-Phish requires the
tection. Note that, the REST API is implemented with Spring computation cost of accessing search engine and maximum
framework. We used POST method for transferring DOM and cost of accessing matched results in SERP.
GET method for URL and Title from extension to Jail-Phish.
Once the Jail-Phish receives these values, it classifies the 5.2. Effectiveness of Jail-Phish
status of website as phishing or legitimate based on the
outputs of modules discussed in Section 3.3.2. In this section, we discuss the effectiveness of our proposed
Since, our goal is to provide real time phishing detection, technique in detecting phishing websites hosted on compro-
the time for visiting matched results and classification has mised servers and the generic phishing sites. Our primary
to be very less. Hence, we parallelized the accessing of each objective of this work is to design a real time application for
matched result with the help of multiple cores present in the phishing detection with high TNR (Legitimate to Legiti-
the system. Once, the classification is done Jail-Phish returns mate). It is also crucial that the time for the detection process
the status to Rest API which is further sent to Extension as a should be very less. We have used Google search engine for
response. the detection of phishing sites. The rationale behind choosing
This whole process of execution took an average time of search engine to detect phishing sites is that search engine
4.23 s. However, this time is dependent on various factors either least indexes or does not index the phishing sites. The
such as number of matched results, speed of Internet, and main reason for choosing the Google is due to its wide usage
speed of the system. In our case, we used a bandwidth of in existing literature. Also, according to Net Market Share,6
100 mbps, as mentioned earlier, system with Intel Xeon 16 74.54 % of the Search Engine market has been acquired by
core Ubuntu server with 2.67 GHz processor and 16GB RAM. Google in 2017. As it is more popular and powerful in search
The output of the extension is a popup window containing engine market, we have chosen the Google search engine to
the status of the website as shown in Fig. 9. In image (a) of implement our method. We did not test our method with
Fig. 9, Jail-Phish detects the paypal website as legitimate with other search engines but it is adaptable to any search engine
a response time of 2.35 s where as image (b) shows the detec- such as Bing, Yahoo, DuckDuckGo or Baidu etc.
tion of phishing site with a response time of 3.60 s in a popup The textual language of the website is a bottleneck in
window containing the status of website and textual content most of the search engine based techniques which leads
recommending the user to close the webpage immediately. to failure cases with non-English website. Moreover, tech-
We calculated the time delays at different cases. Case C1, niques relying on textual content of the website for the
C2 correspond to time taken to detect status of the website keywords extraction results in empty keywords extraction
with search query : domain + title and Case C3, C4 with image based phishing. Hence, we have chosen a search
correspond to time taken to detect status of the website with query as domain + title which is independent of tex-
search query: domain. tual content of the website. But, the considered query also
Case C1: baseURL of suspicious site matched with has resulted more false positives when encountered with
baseURL of returned results (it is classified as legitimate). majority of non-English and low reputed website. Hence, we
Here, baseURL is termed as protocol + hostname + pathname extended the search processing with a new search query
of given URL. containing only Domain which made the technique more
Case C2: domain of suspicious URL matched with powerful in detecting non-English and low reputed legitimate
domains in SERP (SQ=domain + title) sites.
Case C3: domain of suspicious URL matched with But, due to the use of domain or domain + title as
domains in SERP (SQ=domain) search query, the technique also returned PSHCS in SERP. This
Case C4: domain of suspicious URL not matched with made the technique to classify the PSHCS as legitimate thus
any of the domains in SERP. leading to increase of false negative rate. Hence, we used sim-
For the calculation of time taken for detection process, we ilarity computation between the visited website and matched
tested our chrome extension with 100 phishing sites and 100 domain in SERP. If the similarity is above a certain threshold
legitimate sites. The average time delay at each case is given
in Table 8. During the experimental study, we observed that 6
https://www.netmarketshare.com/search- engine- market-
majority of popular legitimate sites fell under the Case C1 share.aspx.
264 computers & security 83 (2019) 246–267

Fig. 9 – Output of Jail-Phish extension.

then it is classified as legitimate else phishing. The rationale JailPhish completely relies on search engine. For example,
behind this decision is that the webpages within a website as shown in Fig. 7, if the search engine indexed the PSHCS
holds some level of similarity with respect to CSS, Javascript, then it is directly classified as legitimate site leading to the
images or anchor links. We have intentionally chosen the increase of FNR.
similarity threshold as zero for the classification of website. Thirdly, it suffers from the response delay due to the
As mentioned earlier in Stage 6, the reason for choosing lag involved in querying search engine and similarity com-
zero threshold is to reduce the number of false positives. But putation between the matched page and suspicious page.
due to this zero threshold, Jail-Phish encounters some of the Ofcourse delay is unacceptable for some users but others
limitations which are discussed in Limitations. wanting zero-day protection will be willing to wait for few
Limitations: Firstly, due to the use of zero as similarity extra seconds to get the status of the website.
threshold, Jail-Phish might loose some phishing sites which Finally, the legitimate sites which are hosted on free web-
use web resources of compromised server. site builders or free domain hosting providers are classified
Secondly, the performance of the proposed method gets as phishing. However, free hosting servers are never used by
affected by the services provided by the search engine since the genuine corporates for their business transactions.
computers & security 83 (2019) 246–267 265

symposium on network computing and applications, NCA


6. Conclusion 2009. IEEE; 2009. p. 266–9.
Basnet RB, Sung AH, Liu Q. Rule-based phishing attack detection.
In this paper, we proposed a heuristic technique which uses Proceedings of international conference on security and
search engine results and similarity-based features to detect management, SAM 2011, Las Vegas, NV, 2011.
the phishing sites. We developed our proposed technique Britt J, Wardman B, Sprague A, Warner G. Clustering potential
phishing websites using deepmd5.. Proceedings of workshop
as an application named as Jail-Phish which takes URL as
on large-scale exploits and emergent threats, LEET, 2012.
input and displays the output as legitimate or phishing. Cao Y, Han W, Le Y. Anti-phishing based on automated individual
The main advantage of our application is that it not only white-list. In: Proceedings of the 4th ACM workshop on digital
detects phishing sites with malicious registrations but also identity management. ACM; 2008. p. 51–60.
detects phishing sites which are hosted on compromised Chang EH, Chiew KL, Tiong WK, et al. Phishing detection via
sites or free website builders. Our application, Jail-Phish identification of website identity. In: Proceedings of the 2013
is language independent and does not depend on prior international conference on IT convergence and security,
ICITCS. IEEE; 2013. p. 1–4.
access to the database of target website resources or web
Chiew KL, Chang EH, Tiong WK, et al. Utilisation of website logo
history. for phishing detection. Comput Secur 2015;54:16–26.
We evaluated our application with popular legitimate doi:10.1016/j.cose.2015.07.006.
set, unpopular legitimate set and observed a TNR of 99.36%. Chiew KL, Choo JSF, Sze SN, Yong KS. Leverage website favicon to
Similarly, our application is also tested on old phishing set, detect phishing websites. Secur Commun Netw 2018;2018.
new phishing set and observed a TPR of 97.77%. These results doi:10.1155/2018/7251750.
Choi H, Zhu BB, Lee H. Detecting malicious web links and
indicate that our application is adaptable to all kinds of
identifying their attack types.. WebApps 2011;11:11.
legitimate sites and phishing sites specifically it performs the
Cui Q, Jourdan GV, Bochmann GV, Couturier R, Onut IV. Tracking
best with long lived phishing sites. The increase in true neg- phishing attacks over time. In: Proceedings of the 26th
ative rate is achieved through variable search query whereas international conference on world wide web. International
true positive rate is achieved through similarity based World Wide Web Conferences Steering Committee; 2017.
mechanism. p. 667–76.
In the future, we intend to include additional similarity Drew J, Moore T. Automatic identification of replicated criminal
websites using combined clustering. In: Proceedings of the
based features and thereby machine learning algorithms will
2014 IEEE security and privacy workshops, SPW. IEEE; 2014.
be attempted to improve the accuracy of the application. Also, p. 116–23.
the response mechanism for the more frequent visited legit- Dunlop M, Groat S, Shelly D. Goldphish: using images for
imate sites can be reduced by adding the browsing history of content-based phishing analysis. In: Proceedings of the 2010
URLS to the whitelist database. We also intend to include non fifth international conference on internet monitoring and
third-party based features for the detection of phishing sites protection, ICIMP. IEEE; 2010. p. 123–8.
hosted on compromised servers. Fu AY, Wenyin L, Deng X. Detecting phishing web pages with
visual similarity assessment based on earth mover’s distance
(EMD). IEEE Trans Depend Secure Comput 2006;3(4):301–11.
Garera S, Provos N, Chew M, Rubin AD. A framework for detection
Conflicts of interest and measurement of phishing attacks. In: Proceedings of the
2007 ACM workshop on recurring malcode. ACM; 2007. p. 1–8.
The authors declare that they have no competing interests. Gastellier-Prevost S, Granadillo GG, Laurent M. Decisive heuristics
to differentiate legitimate from phishing sites. In: Proceedings
of the 2011 conference on network and information systems
security, SAR-SSI. IEEE; 2011. p. 1–9.
Acknowledgment
Gowtham R, Krishnamurthi I. A comprehensive and efficacious
architecture for detecting phishing webpages. Comput Secur
The authors would like to thank Ministry of Electronics & 2014;40:23–37. doi:10.1016/j.cose.2013.10.004.
Information Technology (Meity), Government of India for Hara M, Yamada A, Miyake Y. Visual similarity-based phishing
their support in part of the research. detection without victim site information. In: Proceedings of
the IEEE symposium on computational intelligence in cyber
security, CICS’09. IEEE; 2009. p. 30–6.
R E F E R E N C E S
doi:10.1109/CICYBS.2009.4925087.
He M, Horng SJ, Fan P, Khan MK, Run RS, Lai JL, Chen RJ,
Sutanto A. An efficient phishing webpage detector. Expert
Adebowale M, Lwin K, Sánchez E, Hossain M. Intelligent Syst Appl 2011;38(10):12018–27. doi:10.1016/j.eswa.2011.01.046.
web-phishing detection and protection scheme using Huh JH, Kim H. Phishing detection with popular search engines:
integrated features of images, frames and text. Expert Syst simple and effective. In: Proceedings of international
Appl 2018. symposium on foundations and practice of security. Springer;
APWG. Phishing attack trends reports, fourth quarter 2016. http: 2011. p. 194–207. doi:10.1007/978-3-642-27901-0_15.
//docs.apwg.org/reports/apwg_trends_report_q4_2016.pdf, Jain AK, Gupta BB. Two-level authentication approach to protect
Accessed: 2017-03-03; 2016. from phishing attacks in real time. J Ambient Intell Hum
APWG. Phishing attack trends reports, first half 2017. 2017. http: Comput 2017. doi:10.1007/s12652-017-0616-z.
//docs.apwg.org/reports/apwg_trends_report_h1_2017.pdf, Lee S, Kim J. Warningbird: A near real-time detection system for
Accessed: 2018-01-01. suspicious urls in twitter stream. IEEE Trans Depend Secure
Atighetchi M, Pal P. Attribute-based prevention of phishing Comput 2013;10(3):183–95.
attacks. In: Proceedings of the eighth IEEE international
266 computers & security 83 (2019) 246–267

Li Y, Yang Z, Chen X, Yuan H, Liu W. A stacking model using url communications networks and the workshops, SecureComm
and html features for phishing webpage detection. Futur 2007. IEEE; 2007. p. 454–63.
Gener Comput Syst 2018;94:27–39. RSA. Rsa fraud report. 2013. https://www.emc.com/collateral/
Liu W, Deng X, Huang G, Fu AY. An antiphishing strategy based fraud-report/rsa-online-fraud-report-012014.pdf, Accessed:
on visual similarity assessment. IEEE Internet Comput 2016-07-15.
2006;10(2):58. Sahingoz OK, Buber E, Demir O, Diri B. Machine learning based
Ma J, Saul LK, Savage S, Voelker GM. Beyond blacklists: learning to phishing detection from urls. Expert Syst Appl
detect malicious web sites from suspicious urls. In: 2018;117:345–57.
Proceedings of the 15th ACM SIGKDD international Sheng S., Wardman B., Warner G., Cranor L.F., Hong J., Zhang C.
conference on knowledge discovery and data mining. ACM; An empirical analysis of phishing blacklists2009.
2009. p. 1245–54. Shirazi H, Bezawada B, Ray I. Kn0w thy doma1n name: unbiased
Mao J, Tian W, Li P, Wei T, Liang Z. Phishing-alarm: robust and phishing detection using domain name based features. In:
efficient phishing detection via page component similarity. Proceedings of the 23nd ACM on symposium on access
IEEE Access 2017;5:17020–30. control models and technologies. ACM; 2018. p. 69–75.
Marchal S, Armano G, Gröndahl T, Saari K, Singh N, Asokan N. Su KW, Wu KP, Lee HM, Wei TE. Suspicious url filtering based on
Off-the-hook: an efficient and usable client-side phishing logistic regression with multi-view analysis. In: Proceedings of
prevention application. IEEE Trans Comput the 2013 eighth Asia joint conference on information security,
2017;66(10):1717–33. Asia JCIS. IEEE; 2013. p. 77–84.
Marchal S, Saari K, Singh N, Asokan N. Know your phish: novel Tan CL, Chiew KL, Wong K, Sze SN. Phishwho: phishing webpage
techniques for detecting phishing sites and their targets. In: detection via identity keywords extraction and target domain
Proceedings of the 2016 IEEE 36th international conference name finder. Decis Support Syst 2016;88:18–27.
on distributed computing systems, ICDCS. IEEE; 2016. doi:10.1016/j.dss.2016.05.005.
p. 323–33. Thomas K, Grier C, Ma J, Paxson V, Song D. Design and evaluation
Moghimi M, Varjani AY. New rule-based phishing detection of a real-time url spam filtering service. In: Proceedings of the
method. Expert Syst Appl 2016;53:231–42. 2011 IEEE symposium on security and privacy, SP. IEEE; 2011.
doi:10.1016/j.eswa.2016.01.028. p. 447–62.
Mohammad RM, Thabtah F, McCluskey L. An assessment of Tout H, Hafner W. Phishpin: An identity-based anti-phishing
features related to phishing websites using an automated approach, 3. IEEE; 2009. p. 347–52.
technique. In: Proceedings of the 2012 international Varshney G, Misra M, Atrey PK. Improving the accuracy of search
conference for internet technology and secured transactions. engine based anti-phishing solutions using lightweight
IEEE; 2012. p. 492–7. features. In: Proceedings of the 2016 11th international
Mohammad RM, Thabtah F, McCluskey L. Intelligent rule-based conference for internet technology and secured transactions,
phishing websites classification. IET Inf Secur ICITST. IEEE; 2016a. p. 365–70.
2014;8(3):153–60. Varshney G, Misra M, Atrey PK. A phish detector using
Moore T, Clayton R. Examining the impact of website take-down lightweight search features. Comput Secur 2016b;62:213–28.
on phishing. In: Proceedings of the anti-phishing working doi:10.1016/j.cose.2016.08.003.
groups 2nd annual eCrime researchers summit. ACM; 2007. Verma R, Dyer K. On the character of phishing urls: accurate and
p. 1–13. robust statistical learning classifiers. In: Proceedings of the
Nguyen LAT, To BL, Nguyen HK, Nguyen MH. A novel approach for 5th ACM conference on data and application security and
phishing detection using url-based heuristic. In: Proceedings privacy. ACM; 2015. p. 111–22.
of the 2014 international conference on computing, Verma R, Shashidhar N, Hossain N. Detecting phishing emails the
management and telecommunications, ComManTel. IEEE; natural language way. In: Proceedings of the European
2014. p. 298–303. symposium on research in computer security. Springer; 2012a.
Pao HK, Chou YL, Lee YJ. Malicious url detection based on p. 824–41.
kolmogorov complexity estimation, 01. IEEE Computer Verma R, Shashidhar N, Hossain N. Two-pronged phish snagging.
Society; 2012. p. 380–7. In: Proceedings of the 2012 seventh international conference
Prakash P, Kumar M, Kompella RR, Gupta M. Phishnet: predictive on availability, reliability and security, ARES. IEEE; 2012b.
blacklisting to detect phishing attacks. In: Proceedings IEEE p. 174–9.
INFOCOM. IEEE; 2010. p. 1–5. Wang W., Shirley K. Breaking bad: detecting malicious domains
doi:10.1109/INFCOM.2010.5462216. using word segmentation. arXiv preprint arXiv:1506041112015.
Ramesh G, Krishnamurthi I, Kumar KSS. An efficacious method Whittaker C, Ryner B, Nazif M. Large-scale automatic
for detecting phishing webpages through target domain classification of phishing pages. Proceedings of network and
identification. Decis Support Syst 2014;61:12–22. distributed system security symposium, NDSS ’10, 2010.
doi:10.1016/j.dss.2014.01.002. Xiang G, Hong J, Rose CP, Cranor L. Cantina+: a feature-rich
Rao RS, Ali ST. A computer vision technique to detect phishing machine learning framework for detecting phishing web sites.
attacks. In: Proceedings of the 2015 fifth international ACM Trans Inf Syst Secur 2011;14(2):21.
conference on communication systems and network doi:10.1145/2019599.2019606.
technologies, CSNT. IEEE; 2015. p. 596–601. Xiang G, Hong JI. A hybrid phish detection approach by identity
doi:10.1109/CSNT.2015.68. discovery and keywords retrieval. In: Proceedings of the 18th
Rao RS, Pais AR. An enhanced blacklist method to detect phishing international conference on world wide web. ACM; 2009.
websites. In: Proceedings of the international conference on p. 571–80.
information systems security. Springer; 2017. p. 323–33. Xu L, Zhan Z, Xu S, Ye K. Cross-layer detection of malicious
Rao RS, Pais AR. Detection of phishing websites using an efficient websites. In: Proceedings of the third ACM conference on data
feature-based machine learning framework. Neural Comput and application security and privacy. ACM; 2013. p. 141–52.
Appl 2018. doi:10.1007/s00521-017-3305-0. Zhang J, Porras PA, Ullrich J. Highly predictive blacklisting.. In:
Rosiello AP, Kirda E, Ferrandi F, et al. A layout-similarity-based Proceedings of the USENIX security symposium; 2008.
approach for detecting phishing pages. In: Proceedings of the p. 107–22.
third international conference on security and privacy in
computers & security 83 (2019) 246–267 267

Zhang W, Jiang Q, Chen L, Li C. Two-stage elm for phishing web Routhu Srinivasa Rao is Research Scholar
pages detection using hybrid features. World Wide Web in Department of Computer Science and
2017;20(4):797–813. doi:10.1007/s11280-016-0418-9. Engineering, NITK Surathkal, India. He
Zhang Y, Hong JI, Cranor LF. Cantina: a content-based approach completed his B.Tech. (Computer Science
to detecting phishing web sites. In: Proceedings of the 16th and Engg.) from SRKR Engineering Col-
international conference on world wide web. ACM; 2007. lege, Andhra University, India and M.Tech.
p. 639–48. doi:10.1145/1242572.1242659. (Computer Science and Engg.) from NIT
Zhao P, Hoi SC. Cost-sensitive online active learning with Kurukshetra, Haryana, India. His area of
application to malicious url detection. In: Proceedings of the interest include Information Security, Cyber
19th ACM SIGKDD international conference on knowledge Security and Phishing.
discovery and data mining. ACM; 2013. p. 919–27.
Zhou Y, Zhang Y, Xiao J, Wang Y, Lin W. Visual similarity based Alwyn Roshan Pais is Assistant Professor in
anti-phishing with the combination of local and global Department of Computer Science and En-
features. In: Proceedings of the 2014 IEEE 13th international gineering, NITK Surathkal, India. He com-
conference on trust, security and privacy in computing and pleted his B.Tech. (CSE) from Mangalore Uni-
communications, TrustCom. IEEE; 2014. p. 189–96. versity, India, M.Tech. (CSE) from IIT Bombay,
India and Ph.D. from NITK, India. His area of
interest include Information Security, Image
Processing and Computer Vision.

You might also like