Professional Documents
Culture Documents
ABSTRACT As a crime of employing technical means to steal sensitive information of users, phishing
is currently a critical threat facing the Internet, and losses due to phishing are growing steadily. Feature
engineering is important in phishing website detection solutions, but the accuracy of detection critically
depends on prior knowledge of features. Moreover, although features extracted from different dimensions
are more comprehensive, a drawback is that extracting these features requires a large amount of time.
To address these limitations, we propose a multidimensional feature phishing detection approach based on
a fast detection method by using deep learning. In the first step, character sequence features of the given
URL are extracted and used for quick classification by deep learning, and this step does not require third-
party assistance or any prior knowledge about phishing. In the second step, we combine URL statistical
features, webpage code features, webpage text features, and the quick classification result of deep learning
into multidimensional features. The approach can reduce the detection time for setting a threshold. Testing
on a dataset containing millions of phishing URLs and legitimate URLs, the accuracy reaches 98.99%, and
the false positive rate is only 0.59%. By reasonably adjusting the threshold, the experimental results show
that the detection efficiency can be improved.
INDEX TERMS Phishing website detection, convolutional neural network, long short-term memory
network, semantic feature, machine learning.
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
15196 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
P. Yang et al.: Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning
Blacklists and whitelists are widely used in phishing method has higher accuracy, it requires extracting features
website detection. The current common browsers integrate from different aspects, resulting in longer detection time.
blacklists and whitelists to protect users from phishing In contrast, the method for the URL character sequences
attacks. Google provides a blacklist of malicious websites only needs to process the URL, and the detection time is
that is continuously updated. Users can check the security of short. To balance the contradiction between detection time
URL links through Google Safe Browsing APIs [6]. Phishing and accuracy, we improve the output judgment condition of
website detection based on blacklists and whitelists is easy the softmax classifier in the deep learning process by setting
to implement with high running speed and a low false pos- a threshold to reduce the detection time. If the result of deep
itive rate. However, according to statistics [7], 47%-83% of learning is not less than the specified threshold, the detection
phishing websites are added to blacklists after 12 hours, and result is directly output; otherwise, go to the second step of
63% of phishing websites have a lifespan of only 2 hours; detection.
thus, the updating of the blacklist is far behind the gen- In particular, our key contributions in this work are listed
eration of phishing websites. In addition to blacklist and as follows:
whitelist, machine learning methods are widely used in phish- • With the phishing website detection as a two-category
ing website detection [8], [9]. The reason is that malicious processing model, we formally define the problem of
URLs or phishing webpages have some characteristics that phishing detection and give a specific formal description
can be distinguished from legitimate websites, and machine of the MFPD approach.
learning can be effective in this regard for processing. • We build a real dataset by crawling a total of 1 021
Current mainstream machine learning methods of phishing 758 phishing URLs as positive samples from phish-
website detection extract statistical features from the URL tank.com, and a total of 989 021 legitimate URLs as
and the host [10] or extract relevant features of the webpage, negative samples from dmoztools.net.
such as the layout, CSS, text [11], [12], and then classify • The process of phishing website detection using MFPD
these features. However, these methods only analyze the is explained, and an extensive experiment on the dataset
URL or extract features from a single perspective, which we built is conducted. The results show that our pro-
makes it difficult to extract the complete attributes of phishing posed approach exhibits good performance in terms of
websites. Moreover, some unreasonable features may reduce accuracy, false positive rate, and speed.
the accuracy of detection. The character sequence of the • A dynamic category decision algorithm (DCDA) is pro-
URL is natural, automatically generated feature that avoids posed. By revising the output judgment conditions of
the subjectivity of artificially selected features. In addition, the softmax classifier in the deep learning process and
it does not require third-party assistance and any prior knowl- setting a threshold, the detection time can be reduced.
edge about phishing. However, in the process of character The paper is organized as follows. In Section II,
sequencing, the difficulty is to effectively extract association we present related work on phishing website detection.
and semantic information. Then, in Section III, we introduce the framework of MFPD.
To address these problems, we propose a multidimensional In Section IV, we describe the detailed process of the MFPD,
feature phishing detection approach based on a fast detection which includes the CNN-LSTM and multidimensional fea-
method by using deep learning (MFPD). In the first step, tures. The performance of the proposed approach is evaluated
character sequence features of the given URL are extracted in Section V. Finally, in Section VI, we conclude the paper
and used for quick classification by deep learning. Specially, and discuss future work.
the CNN (convolutional neural network) is used to extract
local correlation features through a convolutional layer. In a II. RELATED WORK
URL, each character may be related to nearby characters. In this section, we describe the phishing website detection
Generally speaking, a phishing website is likely to mimic method based on machine learning, including traditional
the URL of a legitimate website by changing or adding some methods and deep learning methods.
characters. This can cause the sequential dependency of the The phishing website detection based on machine learn-
phishing URL to be different from the phishing URL. The ing is a hotspot of current phishing website detection
LSTM network can effectively learn the sequential depen- research. The results of machine learning methods usually
dency from character sequences. Therefore, the LSTM (long depend on the quality of the extracted features. The focus of
short-term memory) network is employed to capture con- current research is on how to extract and select more effective
text semantic and dependency features of URL character features before processing them.
sequences, and at finally softmax is used to classify the Resources on the Internet are addressed by URLs, which
extracted features. We call the first step CNN-LSTM. From consist of the Hostname and FreeURL. The typical URL
a comprehensive perspective, in the second step, we combine structure is shown in Fig. 1.
URL statistical features, webpage code features, webpage Considering a phishing URL that imitates PayPal ‘‘http://
text features and the classification result of deep learning cancellation-paypal.us-com.15ffe4fd8f.com/signin/’’ as an
into multidimensional features, which are then classified by example. The structure is as follows:
XGBoost. Although the multidimensional feature detection • Protocol: http
neural network, and its neurons are sparsely connected, which describe phishing website detection as:
reduces the complexity of the network model and improves
∀ui ∈ D, Ci0 = t(ui ) (1)
training performance.
Traditional neural network models are not suitable for The objective function is:
processing time series problems, while RNN (Recent Neural n n
X X
Network) is good at dealing with time series problems of O(u) = max(n − (t(ui ) ⊕ Ci )/ tci ) (2)
data with learning the previous information [30]. The cur- i=1 i=1
rent output is not only related to the current input but also
The essence of solving the phishing website detection
related to the previous output. The LSTM (long short-term
problem is to find a suitable function t such that the objective
memory) network is an improved version of the traditional
function obtains the maximum value.
RNN model, which can solve the long-distance dependence
problem caused by ‘‘gradient dispersion’’ in the RNN.
B. THE FRAMEWORK OF MFPD
At present, there are some studies on phishing website
detection based on deep learning. Selvaganapathy et al. [31] In this paper, we built the framework of the proposed
proposed a phishing URL detection algorithm using stacked approach, referred to as MFPD. MFPD can be described by
restricted Boltzmann machine for feature selection and deep the following four definitions.
neural networks as classifiers. Then, multiple detections were Definition 2 (Character Embedding of ui ): Let the fixed
constructed using IBK-kNN, Binary Relevance, and Label length of the URL ui be L; then, m = 97 × L according to
Powerset with SVM. This model improves the accuracy of Table 1. For ui , the length of the URL character sequence is
detection by combining the recognition results of multiple unified based on the formula (5)ei = URLF(ui ) and encoded
classifiers. Bahnsen et al. [32] extracted the syntax and sta- based on Table 1, gi = Asc(ei ). Then, regarding gi as a vector
and −→ j j
tistical characteristics of the URL, and then classified the gi = (g1i , g2i , . . . , gi )T , gi indicates the j-th elements
character sequence of the URL using LSTM. By comparing in the vector, 1 ≤ j ≤ m. All URLs form a matrix G,
with RF, experiments showed that LSTM was better than RF. G = Gm×n = (− →
g1 , − →
g2 , . . . , −
→
gn ). Finally, the embedding
Based on the above analysis, we regard the URL strings network is used to reduce the sparsity of G. Letting the
as URL character sequences, which are natural features that network weight be V , V ∈ Rp×m , the result of character
do not require prior knowledge about phishing. In the pro- embedding is:
v11 . . . v1m g11 . . . g1n
cessing of URL character sequences, we refer the idea of
the literature [33] to treat the URL as a sequence of text S = VG = ... ..
.
.. × ..
. .
..
.
..
.
string and quantize the URL at the character level. Therefore,
we take advantages of CNN to extract the local features of the vp1 · · · vpm gm1 ··· gmn
sequence, and then use take advantages of LSTM to extract = ( s1 , s2 , . . . , −
−
→ −
→ →
sn ) (3)
the context semantic features of the sequence, and finally the
and S ∈ Rp×n . The process of character embedding is
extracted features are classified by softmax.
T : U → S.
A. PROBLEM STATEMENT
Suppose we are given a set U consists of all URLs.
U = {u|u = xi , xi ∈ url, i ∈ N + }, |U | = n. Let Cp as a Definition 3 (CNN-LSTM): Let W be the weight of
set indicating phishing, Cp = {c|c = p, p ∈ phishing}, Cl as the CNN-LSTM network. C 00 is the result of CNN-LSTM
a set indicating legitimate, Cl = {c|c = l, l ∈ legitimate}, processing.
and C = Cp ∪ Cl , ui is a suspicious URL. Formally, phishing For the set U , U = Utrain ∪ Utest , and Utrain ∩ Utest = ∅.
website detection problem can be defined as follows: The training and testing processes for this network are:
Definition 1 (Phishing Detection of ui ): Let P(C) as the
power set of C, the time cost of detecting ui is tci . Defining Train(∪i∈t ui , ∪i∈t Ci ) → W , ui ∈ Utrain , |Utrain | = t,
function t : U → P(C), t as a mapping relationship needs Test(∪j∈h uj , W ) → C , uj ∈ Utest , |Utest | = h.
00
to be found to implement the detection of ui . Let Ci0 as the
Ci0 , Ci as the category to
calculated result by t, and C 0 = ∪i∈n( Therefore, the CNN-LSTM can be described as follows:
0 legitimate Strain = T (Utrain ), Stest = T (Utest ),
the URL ui , Ci ∈ C, obviously Ci = . We can
1 phishing C = Test(Stest , Train(Strain , ∪j∈h Cj )),
00
|Strain | = h.
Definition 4 (Multidimensional Features): Let the URL time cost. Therefore, to achieve real-time detection, at the last
statistical features be defined as set Fu, the webpage code module DCDA, we improve the classification output of the
features be set Fc, and the webpage text features be set Ft. softmax classifier. The threshold is used to determine whether
For C 00 obtained from CNN-LSTM, the multidimensional to focus on accuracy or real-time detection. Moreover, when
features are: the URL is unreachable, the output of softmax is directly used
as the result of the detection.
F = C 00 ∪ Fu ∪ Fc ∪ Ft, C 0 = XGBoost(F)
Definition 5 (DCDA): We revised the softmax in the IV. OVERVIEW OF ALGORITHM
CNN-LSTM. Let p0 be the probability of a legitimate website In this section, we introduce in detail the process of data
output by the softmax, while p1 is the probability of a preprocessing and the construction and training of the
phishing website output, and let α be the threshold. CNN-LSTM algorithm, the phishing detection method based
max(p0 , p1 ) on multidimensional features, and the optimization strategy
So = , p1 = 1 − p0 (4) DCDA for balancing accuracy and speed.
min(p0 , p1 )
if So > α or di ∈ unaccessible, C 0 = arg max(p0 , p1 );
otherwise, go to the Multidimensional Features. A. THE CNN-LSTM ALGORITHM
The framework of our proposed approach is divided into The CNN-LSTM algorithm contains three parts as shown
three modules, as shown in Fig. 2. The first is the CNN- in Fig. 3: URL embedded representation, feature extraction,
LSTM module, which contains data preprocessing, feature classification.
extraction and classification. The data consists of a large At the embedded representation stage of the URL, the URL
number of legitimate and phishing URLs collected from the character sequence is normalized to a fixed-size sequence by
Internet. The URL character sequences are preprocessed, intercepting or zero-filling, and then, the normalized string is
which includes length normalization, uniform encoding, and converted into a one-hot code sequence according to Table 1.
using an embedding layer to reduce the sparsity of the data. Then, the sparse one-hot matrix is converted into a dense
In feature extraction, the CNN is used to extract local features, character embedding matrix through the embedding layer.
and LSTM is used to extract context dependency. We use In the feature extraction stage, the local deep correlation
softmax to classify the clean-up features. The second module feature is extracted from the embedding matrix through the
defines the Multidimensional Features, which are based on convolutional layer and maximum pooling of the CNN. Then,
URL statistical features, webpage code features, webpage the result of the pooling is input to the LSTM neural network
text features. The result of the CNN-LSTM is then used by to capture the context of the URL sequence. In the classifica-
XGBoost for classification. The second module has greater tion stage, the output of the last moment of the LSTM neural
accuracy than the CNN-LSTM module but also has more network is input to the softmax unit. To prevent overfitting,
Algorithm 1 The CNN-LSTM Algorithm TABLE 2. URL and webpage code features.
Input: The training set Utrain , the testing set Utest , ui ∈
Strain , u0i = Stest .
Output: The probability of phishing P(u0i ).
1: K = num of sliding-window, k = size of sliding step,
p = size of pooling-window, β = threshold of the loss
function L(x, y). Weightembedding = V , V ∈ Rp×m , G ∈
Rm×n .
2: U = Utrain ∪ Utest , l = |U | , G = ∅, N = dK /pe,
3: for i in l do
4: ui ∈ U , ei = URLF(ui )
5: gi = Asc(ei )
6: G = G ∪ gi
7: end for
8: S = VG = (− →s1 , −
→
s2 , . . . , −
→
sn )
9: for f in Q do
10: for i in C do
hi = σ (Wf · (− →si , − →, . . . , −−−→) + b )
f
11: si+1 si+k−1 f
12: end for
13: end for
14: for f in Q do
15: for k in N do
f f f f
16: pj = Max(h(j−1)p , h(j−1)p+1 , . . . , hjp−1 )
17: end for
18: end for− → − → −
→ → −
− →
19: Hp = ( p1 , p2 , . . . , pj , . . . , ps )T , pj ∈ RN ×1
20: H = (h1 , h2 , . . . , hN ) = lstm(Hp )
21: C 00 = softmax(hN )
22: while L(C 00 , C) > β
23: W = Train(ui , Ci )
24: end while
25: P(di0 ) = Test(u0i , W )
26: return P(u0 )
Algorithm 2 The Multidimensional Feature Algorithm FIGURE 8. Dynamic adjustment of the threshold.
The remaining data in DATA are built into the dataset DATA2, negative rate and the detection cost and improves the accuracy
which is DATA1 ∩ DATA2 = ∅. DATA1 is used to verify rate and the detection speed.
the effectiveness of the multidimensional feature algorithm
and DCDA, and DATA2 is used to verify the effectiveness of B. EXPERIMENT ON THE CNN-LSTM
the deep learning algorithm CNN-LSTM. The distribution of This experiment is performed on DATA2 with 5-fold
data is shown in Table 3. cross-validation. Four sets are used as training sets, the
remaining set is used as a test set. First, the parameters of the
TABLE 3. Data distribution. CNN-LSTM algorithm need to be adjusted. The experiment
finds that the average length of legitimate website samples in
dataset DATA is 34.7, the average length of phishing website
samples is 87.3, the average length of all the data is 61.5, and
the length of URLs exceeding 96.3% is below 200.
Therefore, we set the URL fixed length L = 200. The
Let Np indicate the total number of phishing websites, average training curve of cross-validation in the CNN-LSTM
Nn indicate the total number of legitimate websites, Np→n is shown in Fig. 9.
indicate the phishing websites that were misclassified as
legitimate, Nn→p indicate the legitimate websites that were
misclassified as phishing, Np→p indicate the phishing web-
sites that were classified as phishing, and Nn→n indicate the
legitimate websites that were classified as legitimate. We use
accuracy, FPR (false positive rate), FNR (false negative rate),
cost, and detection time to evaluate the effectiveness of the
detection approach we proposed.
Accuracy measures the proportion of all correctly classi-
fied data out of the total data.
Np→p + Nn→n
Accuracy = (11)
Np + Nn
FPR (false positive rate) measures the rate of all legitimate
websites that were misclassified as phishing out of the total
legitimate websites.
FIGURE 9. CNN-LSTM average training curve.
Nn→p
FPR = (12)
Nn When the number of training epochs reaches 20, the accu-
FNR (false negative rate) measures the rate of all phishing racy of the test set is nearly stable; thus, in order to reduce the
websites that were misclassified as phishing out of the total training time and prevent overfitting, we set epochs = 20.
phishing websites. To verify the effect of the CNN-LSTM algorithm, three
Np→n classical deep neural networks, CNN, RNN and LSTM,
FNR = (13) are compared in this experiment. The structure of the
Np
CNN-LSTM algorithm is Input->Conv->Maxpool->
Although misjudging a phishing website as a legitimate LSTM->Softmax. For fairness of the experimental compari-
website may bring security risks, the number of phishing son, the network structures of CNN-CNN, RNN-RNN and
websites in the real world is far less than the number of LSTM-LSTM are compared, whose structure are Input->
legitimate websites, and the risk can be reduced by impart- Conv->Maxpool->Conv->GlobalMaxpool->Softmax,
ing phishing knowledge to users. In addition, very small Input->RNN1->RNN2->Softmax, Input->LSTM1->
false positive rates may also cause high misidentification. LSTM2-> Softmax, respectively.
Moreover, a legitimate website misjudged as a phishing We perform calculations on a high-performance server
website may instill inconvenience and trust issues to the with 64G of memory, a E5-2683 v3 CPU, and GTX 1080ti
operators of the website. Therefore, we hold the view that the GPUs, ensuring that deep learning models can be iterated
consequences are more serious than the former. We propose quickly in dealing with large data volumes. The average
the detection cost to comprehensively evaluate FPR and FNR, experimental results for DATA2 are shown in Table 4:
as follows: From Table 4, Fig. 10 and Fig. 11, the following conclu-
sions can be drawn:
Cost = FNR + λ × FPR, λ>1 (14)
The CNN runs at the highest speed, but the detection effect
We set λ = 5 in the experiment. is the worst; the CNN-CNN is slightly slower, and the detec-
The goal of phishing website detection is to find the max tion effect is mediocre. The RNN and the RNN-RNN have
value of O(u), which reduces the false positive rate, the false medium speed, the detection effect is mediocre. The detection
TABLE 4. Comparison of different models on DATA2. of the network. Since the number of samples in DATA1 is
small, the batchsize value is 64.
The effect of the two experiments is shown in Fig. 12. It can
be seen that the CNN-LSTM model trained in the DATA2
has better effect on DATA1, which confirms that huge data
samples are required for deep learning. However, the effect
is worse than the 5-fold cross-validation of the CNN-LSTM
algorithm on DATA2, as shown in Table 4, because the sam-
ples in DATA1 are currently accessible URLs, whereas the
phishing website URLs in DATA2 are historical data and
currently inaccessible. Phishing website URLs have different
characteristics in different periods due to the different target
websites they imitate. Note that the CNN-LSTM model used
in the following experiments is trained with DATA2.
FIGURE 10. Accuracy and training time per epoch of different models on
DATA2.
FIGURE 15. Four ensemble learning algorithms for classification using TABLE 5. Comparison of MFPD and other approaches.
multidimensional features.
REFERENCES
[1] (2018). Phishing Attack Trends Re-Port-1Q. Accessed: May 5, 2018.
FIGURE 20. Detection number curve under different thresholds. [Online]. Available: https://apwg.org/resources/apwg-reports/
[2] (2017). Kaspersky Security Bulletin: Overall Statisticals For. Accessed: [27] A. K. Jain and B. B. Gupta, ‘‘Towards detection of phishing websites on
Jul. 12, 2018. [Online]. Available: https://securelist.com/ksb-overall- client-side using machine learning based approach,’’ Telecommun. Syst.,
statistics-2017/83453/ vol. 68, no. 4, pp. 687–700, Aug. 2018.
[3] A. Y. Ahmad, M. Selvakumar, A. Mohammed, and A.-S. Samer, ‘‘TrustQR: [28] J. Zhang, Y. Ou, D. Li, and Y. Xin, ‘‘A prior-based transfer learning
A new technique for the detection of phishing attacks on QR code,’’ Adv. method for the phishing detection,’’ J. Netw., vol. 7, no. 8, pp. 1201–1207,
Sci. Lett., vol. 22, no. 10, pp. 2905–2909, Oct. 2016. Aug. 2012.
[4] C. C. Inez and F. Baruch, ‘‘Setting priorities in behavioral interventions: [29] L. Chang et al., ‘‘Convolutional neural networks in image understanding,’’
An application to reducing phishing risk,’’ Risk Anal., vol. 38, no. 4, Acta. Automatica Sinica, vol. 42, no. 9, pp. 1300–1312, Jul. 2016.
pp. 826–838, Apr. 2018. [30] Z. C. Lipton, J. Berkowitz, and C. Elkan. (Oct. 2015). ‘‘A critical review
[5] G. Diksha and J. A. Kumar, ‘‘Mobile phishing attacks and defence mecha- of recurrent neural networks for sequence learning.’’ [Online]. Available:
nisms: State of art and open research challenges,’’ Comput. Secur., vol. 73, https://arxiv.org/abs/1506.00019
pp. 519–544, Mar. 2018. [31] S. G. Selvaganapathy, M. Nivaashini, and H. P. Natarajan, ‘‘Deep belief
[6] Google Safe Browsing APIs. Accessed: Oct. 1, 2018. [Online]. Available: network based detection and categorization of malicious URLs,’’ Inf. Secur.
https://developers.google.com/safe-browsing/v4/ J., Global Perspective, vol. 27, no. 3, pp. 145–161, Apr. 2018.
[7] S. Sheng, B. Wardman, G. Warner, L. Cranor, J. Hong, and C. Zhang, [32] A. C. Bahnse, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. Gonález,
‘‘An empirical analysis of phishing blacklists,’’ in Proc. 6th Conf. Email ‘‘Classifying phishing URLs using recurrent neural networks,’’ in Proc.
Anti-Spam (CEAS), Jul. 2009, pp. 59–78. IEEE APWG Symp. Electron. Res. (eCrime), Apr. 2017, pp. 1–8.
[8] A. K. Jain and B. B. Gupta, ‘‘A novel approach to protect against phishing [33] X. Zhang, J. Zhao, and Y. LeCun, ‘‘Character-level convolutional networks
attacks at client side using auto-updated white-list,’’ EURASIP J. Inf. for text classification,’’ in Proc. 28th Int. Conf. Neural Inf. Process. Syst.
Secur., vol. 2016, no. 1, Dec. 2016, Art. no. 34. (NIPS), Dec. 2015, pp. 649–657.
[9] M. Zouina and B. Outtaj, ‘‘A novel lightweight URL phishing detection [34] Y. Xiao and K. Cho. (Feb. 2016). ‘‘Efficient character-level document
system using SVM and similarity index,’’ Hum.-Centric Comput. Inf. Sci., classification by combining convolution and recurrent layers.’’ [Online].
vol. 7, no. 1, p. 17, Jun. 2017. Available: https://arxiv.org/abs/1602.00367
[10] E. Buber, Ö. Demir, and O. K. Sahingoz, ‘‘Feature selections for the [35] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
machine learning based detection of phishing websites,’’ in Proc. IEEE R. Salakhutdinov. (Jul. 2012). ‘‘Improving neural networks by
Int. Artif. Intell. Data Process. Symp. (IDAP), Sep. 2017, pp. 1–5. preventing co-adaptation of feature detectors.’’ [Online]. Available:
[11] J. Mao et al., ‘‘Detecting phishing websites via aggregation analysis of https://arxiv.org/abs/1207.0580
page layouts,’’ Procedia Comput. Sci., vol. 129, pp. 224–230, Jan. 2018. [36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[12] J. Mao, W. Tian, P. Li, T. Wei, and Z. Liang, ‘‘Phishing-alarm: Robust and R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
efficient phishing detection via page component similarity,’’ IEEE Access, from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
vol. 5, no. 99, pp. 17020–17030, Aug. 2017. Jun. 2014.
[13] J. Cao, D. Dong, B. Mao, and T. Wang, ‘‘Phishing detection method [37] D. P. Kingma and J. Ba. (Dec. 2014). ‘‘Adam: A method for stochastic
based on URL features,’’ J. Southeast Univ.-Engl. Ed., vol. 29, no. 2, optimization.’’ [Online]. Available: https://arxiv.org/abs/1412.6980
pp. 134–138, Jun. 2013. [38] A Open-Content Directory of World Wide Web Links. Accessed:
[14] S. C. Jeeva and E. B. Rajsingh, ‘‘Phishing URL detection-based feature Oct. 12, 2018. [Online]. Available: https://dmoztools.net/
selection to classifiers,’’ Int. J. Electron. Secur. Digit. Forensics, vol. 9,
no. 2, pp. 116–131, Jan. 2017.
PENG YANG received the Ph.D. degree from
[15] A. Le, A. Markopoulou, and M. Faloutsos, ‘‘PhishDef: URL names say it
all,’’ in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), Sep. 2010,
Southeast University, in 2006. He was with
pp. 191–195. CERN as a Research Scientist to take part in the
[16] R. Verma and K. Dyer, ‘‘On the character of phishing URLs: Accurate and Alpha Magnetic Spectrometer Experiment from
robust statisticalal learning classifiers,’’ in Proc. 5th ACM Conf. Data Appl. 2007 to 2009, which was led by Nobel Laureate
Secur. Priv. (ACM CODASPY), Mar. 2015, pp. 111–122. Prof. S. Ting. He is currently an Associate Pro-
[17] Y. Li, S. Chu, and R. Xiao, ‘‘A pharming attack hybrid detection fessor with the School of Computer Science and
model based on IP addresses and Web content,’’ Optik, vol. 126, no. 2, Engineering, Southeast University. His research
pp. 234–239, Nov. 2014. interests include new generation Internet architec-
[18] G. G. Xiang and J. Hong, ‘‘A hybrid phish detection approach by identity ture, network security, and cyber content gover-
discovery and keywords retrieval,’’ in Proc. Int. Conf. World Wide Web nance. He is the Deputy Director of the Future Network Research Center,
(WWW), Oct. 2009, pp. 571–580 Southeast University, and a member of the National Technical Committee of
[19] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, ‘‘CANTINA+: A feature- Standardization Administration of China.
rich machine learning framework for detecting phishing Web sites,’’ ACM
Trans. Inf. Syst. Secur., vol. 14, no. 2, p. 21, Sep. 2011.
[20] S. Marchal, K. Saari, N. Singh, and N. Asokan, ‘‘Know your phish: Novel GUANGZHEN ZHAO received the B.S. and M.S.
techniques for detecting phishing sites and their targets,’’ in Proc. IEEE degrees from the School of Information Science
36th Int. Conf. Distrib. Comput. Syst. (ICDCS), Jun. 2016, pp. 323–333. and Technology, Shijiazhuang Tiedao University,
[21] R. Patil, B. D. Dhamdhere, K. S. Dhonde, and R. G. Chinchwade, China, in 2015 and 2018, respectively. He is cur-
‘‘A hybrid model to detect phishing-sites using clustering and Bayesian rently pursuing the Ph.D. degree with the School
approach,’’ in Proc. IEEE Int. Conf. Converg. Technol. (I2CT), Apr. 2014, of Computer Science and Engineering, Southeast
pp. 1–5. University. His current research interests include
[22] M. Arab and M. K. Sohrabi, ‘‘Proposing a new clustering method to detect network security, situation awareness, and new
phishing websites,’’ Turkish J. Electr. Eng. Comput. Sci., vol. 15, no. 1, generation Internet architecture.
pp. 92–95, Jun. 2015.
[23] A. Shinde, A. Pandey, R. Pawar, and V. Gangule, ‘‘Clustering and Bayesian
approach-based model for detection of phishing,’’ Int. J. Comput. Appl.,
vol. 118, no. 24, pp. 30–33, May 2015. PENG ZENG received the B.S. degree from the
[24] X. Zhang, Z. Yan, H. Li, and G. Geng, ‘‘Research of phishing detection Wuhan University of Science and Technology,
technology,’’ Chin. J. Netw. Inf. Secur., vol. 3, no. 7, pp. 7–24, Jul. 2017. Wuhan, China, in 2015, and the M.S. degree from
[25] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, ‘‘Beyond blacklists: Southeast University, Nanjing, China, in 2018.
Learning to detect malicious web sites from suspicious URLs,’’ in Proc. His research interests include Web security,
15th ACM SIGKDD Int. Conf. Knowl. Discover Data Mining (KDD), cloud computing, and new generation Internet
Jan. 2009, pp. 1245–1254. architecture.
[26] R. M. Mohammad, F. Thabtah, and L. Mccluskey, ‘‘Predicting phishing
websites based on self-structuring neural network,’’ Neural Comput. Appl.,
vol. 25, no. 2, pp. 443–458, Aug. 2014.