You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348167336

Anti-phishing System using LSTM and CNN

Conference Paper · November 2020


DOI: 10.1109/INOCON50539.2020.9298298

CITATIONS READS

13 462

3 authors, including:

U. Srinivasulu Reddy
National Institute of Technology Tiruchirappalli
63 PUBLICATIONS 1,269 CITATIONS

SEE PROFILE

All content following this page was uploaded by U. Srinivasulu Reddy on 06 August 2021.

The user has requested enhancement of the downloaded file.


2020 IEEE International Conference for Innovation in Technology (INOCON)
Bengaluru, India. Nov 6-8, 2020

Anti-phishing System using LSTM and CNN


Yazhmozhi. V.M B. Janet Srinivasulu Reddy
Dept. of Computer Applications Dept. of Computer Applications Dept. of Computer Applications
National Institute of Technology, National Institute of Technology, National Institute of Technology,
Tiruchirappalli, India Tiruchirappalli, India Tiruchirappalli, India
205219027@nitt.edu janet@nitt.edu usreddy@nitt.edu

Abstract— Users prefer to do e-banking and e-shopping requires a lot of domain knowledge and is highly time
now-a-days because of the exponential growth of the internet. consuming. In this paper, a deep learning solution has been
Because of this paradigm shift, hackers are finding umpteen proposed to detect phishing websites. The URLs are
ways to steal our personal information and critical details like preprocessed. Then, they are fed into the input layer of the
details of debit and credit cards, by disguising themselves as model. The dimension of an input instance is 75x32. The
reputed websites, just by changing the spelling or making final solution has been achieved after training the model by
minor modifications to the URL. Identifying whether an URL tuning the hyper parameters and choosing the best
is benign or malicious is a challenging job, because it makes use
performing model from it. The model has been trained for
of the weakness of the user. While there are several works
200 epochs as the validation error increases after that. So, the
carried out to detect phishing websites, they only use heuristic
methods and list based techniques and therefore couldn’t avoid
training has been stopped after 200 epochs and the model is
phishing effectively. In this paper an anti-phishing system was tested. The achieved accuracy is 96% and the precision for
proposed to protect the users. It uses an ensemble model that the legitimate and phishing class are 95% and 97%
uses both LSTM and CNN with a massive data set containing respectively which is far better than the existing solutions.
nearly 2,00,000 URLs, that is balanced. After analyzing the Section III provides a detailed description of the proposed
accuracy of different existing approaches, it has been found work and the model architecture. The proposed model gave a
that the ensemble model that uses both LSTM and CNN good precision, recall and accuracy. The precision in
performed better with an accuracy of 96% and the precision is identifying malicious websites was 97%.
97% respectively which is far better than the existing solutions.
II. RELATED WORK
Keywords— Phishing, CNN, RNN, Deep Learning,
Classification, Security
A. Black list approaches:
The list based systems to detect malicious URLs uses a
I. INTRODUCTION white list that is a collection of valid URLs. The black list is
Because of the internet's tremendous growth, many users a collection of malicious URLs. When an user enters a URL,
have shifted to electronic mode of shopping and banking. the system compares the entered URL with the preset URL
But, the internet system has little influence over it and that list. And, if the entered URL is available in the list of valid
has lead to many different types of attacks. So, hackers have URLs, the user will be allowed to access the page. Many
started to fool users by sending a website URL that looks systems have been suggested that used a number of different
similar to the banking or shopping site etc. through SMS or methods to recognize malicious URLs that used a list based
mail. And, when the user clicks on it unknowingly and does approach. While this method effectively recognizes many
transactions, the hacker can steal lot of critical information malicious URLs, there is a biggest downside to this method.
and even go up to the point of malware installation. Phishing We need to update the list frequently. [1] To prevent this
is a cybercrime in which the hackers attempt to steal sensitive downside they constantly revised the lists of URLs. The
information like user name, passwords, debit/credit card proposed system also functions efficiently for the blind, and
information etc. by sending a mail or SMS to the users that allows them to access sites for currency transactions etc.
look similar to the URL of an organization or bank. The
users will also be fooled by it, and they may submit all their The proposed framework therefore instantiates a white
important details in the website thinking that it is legitimate. list, which at first has no URLs in it. Then subsequently it
And, phishing has been a threat all over the world leading to continues to update the list, by doing thorough verification of
loss of billions of dollars. Initially, many works have been all the specifications of the filter.
done by using a separate black list to store the phishing
URLs. And, whenever an user tries to open an URL in a B. Image based techniques
blacklist, it will be blocked by the system. Then several rule With the aid of visual correlations, malicious websites
based approaches have been used to detect malicious were identified. This is accomplished by breaking the URL
websites. Machine learning methods were used after list into various parts, with the help of visual indicators. The
based and rule based approaches to boost the efficiency of (visual) similarity is measured using different levels, such as
detecting malicious websites and to effectively improve the block level, form level, and layouts. [2]
performance. Many works has been done using machine
learning algorithms such as SVM, decision tree, and random C. Machine learning techniques
forest classifiers, by varying the features used for model
Buber et.al., have suggested a method that encompasses
training. When it comes to machine learning techniques,
nearly two hundred word vectors and seventeen features that
natural language based features extracted from the URL gave
are based on natural language. However, this framework
a better accuracy than other features. But, feature extraction
didn’t give a generalised approach. [3] CANTINA is a

978-1-7281-9744-9/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on August 06,2021 at 06:26:34 UTC from IEEE Xplore. Restrictions apply.
system that is used to detect malicious websites by making the pictures on the website. The system’s accuracy is 93.28%
use of TF-IDF technique for the purpose of feature [12].
extraction. And, it’s used to construct Google query. If the
particular URL is present in the search results, then the URL Zhao et.al., proposed that with no need for manual feature
is legitimate, else it is malicious. [4] An upgraded version of development, the gated recurrent neural networks with well-
CANTINA is CANTINA+. This contained 15 attributes selected features consistently outperformed the random forest
which were based on HTML. 92 percent accuracy was approach. Thus they conclude that it is an effective and
obtained. [5] proactive constructive identification method that is more
sophisticated in the cyber era. [13] The title tag information
Numerous systems that detect malicious URLs using on of the website is entered into the Baidu search engine as
machine learning algorithms were proposed. PILFER, a search terms and the webpage is deemed legal if the website
model for identifying malicious URLs in which ten features domain matches the domain name of one of the first ten
were selected, primarily used to confuse and cheat users. The search results; otherwise additional analysis is carried out. At
gathered data had 7,860 URLs only. They used the SVM last, a logistic regression is used to evaluate the leftover
algorithm to detect malicious URLs. As the size of the data pages to improve the robustness and accuracy of the
set was also comparatively lower and as they’ve tried only identification process. The test findings show that the system
one algorithm, the model’s accuracy was 92%. [6] can filter 61.9 percent of genuine web pages and classify 22.9
percent of phishing web pages based on URL’s lexical
Moghimi et al. proposed a system that also used a SVM
information. [14]
model for the classification task. This also uses SVM DT to
figure out which of the concealed phishing details really is in
the URL. They'd used a very big collection of data to train III. PROPOSED SYSTEM
the model. They had reached a 99 percent true positive rate The proposed work was broken down into 6 phases:-
(TPR) and a 0.1 percent False negative rate. The only 1. Data set construction
downside to this method is the fact that only legitimate 2. URL pre-processing
websites can take the contents of a phishing page, which isn't 3. Deciding the model architecture
always accurate. [7] 4. Training the model
Chiew et al. developed a feature selection function called 5. Testing the model
HEFS. HEFS stands for hybrid ensemble feature selection. 6. Model deployment
HEFS uses a function approximation for the cumulative
distribution of gradients to identify primary features. A. Data set Construction
Subsequently, it uses an ensemble technique to identify The performance of deep learning models is highly
secondary features. Then, they effectively improved HEFS dependent on the size of the data set. We collected phishing
efficiency by combining it with Random Forest Classifier. URLs from Phish Tank and virus total. The legitimate URLs
[8] had been collected from various sources using Yandex search
API. We collected around 97,400 phishing URLs and 97,400
D. Deep learning techniques legitimate URLs. The final data set had 1,94,800 URLs. To
keep the data set balanced, nearly equal size of phishing and
A system called URLNet, was proposed by Le et al. It legitimate URLs had been collected. The plot in Fig.1 shows
used CNN at the character and word level embedding of the the number of phishing and legitimate URLs in the data set.
URL. For word level CNN, features were used. The data set The label 0 indicates phishing URLs and 1 indicates
used was massive. They used 5 million URLs for training legitimate URLs.
and 10 million URLs for testing. They’ve found that
concatenated character and word level CNN performed better
than character level or word level CNN [9].
Yi et al. used DBN (Deep Belief Networks) to detect
phishing websites. They’ve shown that the TPR (True
Positive Rate) increases as the number of hidden units
increases. And, after many comparisons, they used a DBN
with 40 hidden units. They achieved a TPR of 90% with the
40 hidden units DBN [10].
Nivaashini used auto encoders to detect phishing
websites. They found a way to represent the URL using auto
encoders. Their results show that they obtained less false
positive rate than the other works using auto encoders [11].
To detect phishing URLs using combination of CNN and Fig. 1. Data distribution plot
LSTM, Adebowale et.al. have used the image, frame and text
content of the website. This is the first work that uses a deep
B. URL pre-processing
learning algorithm to find the best combined text, image and
frame-based solution. Using LSTM on text and frame content The URLs collected cannot be fed directly as inputs to
and CNN on images, the proposed framework uses two DL the deep learning model. So, pre-processing the URLs is an
layers to identify phishing websites. The model can thus essential step. The number of characters in the URL is
easily explore the diversity of the terms contained in the limited to 75. So, if the URL length is more than 75, the
Universal Resource Locator (URL) of the website, and even excessive characters are removed. And, if the length of the
URL is less than 75, then required numbers of zeros are

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on August 06,2021 at 06:26:34 UTC from IEEE Xplore. Restrictions apply.
padded to it to make the length as 75. So, now we have an model. In particular, we use an embedding layer to reduce the
input matrix X of dimension 1,94,800x75. feature space. The output dimension of embedding layer is
75x32. Then a dropout of 0.25% was done. Then three
C. Model Architecture convolution layers with 256 filters and kernel size of 5,6 and
The proposed model is an ensemble model of both CNN 7 were introduced. The activation function used was ReLU
and RNN. Moreover, the performance of a neural network (Rectified Linear Unit). The pooling technique used was max
model depends on the massiveness of the data set and the pooling with a size of 4. Then, again to avoid over fitting,
hyper parameters such as batch size, number of hidden dropout was introduced at the end of those layers. Now, a
layers, activations used, drop out to prevent over fitting etc. LSTM (Long Short Term Memory) layer was introduced
The model architecture that was visualized using graphviz wherein the concept of ensemble modeling comes into effect.
[15], open source graph visualization software is shown in The LSTM layer had 32 neurons in it and tanh activation
Fig.2. The input URL which was preprocessed is passed to function was used. The last layer was a fully connected layer
an embedding layer, which is the first hidden layer in the with sigmoid activation function as we have a binary
classification problem. Finally, Adam optimizer was used
with a learning rate of 1e-4 was used and the model was
compiled. Cross entropy loss function was used, as it proves
to be the best for binary classification problems.

D. Training the model


The data set that is constructed is split into training and
test sets. The train test split rule is 80/20.The validation split
was chosen as 0.2. So, the training set had 1,55,840 URLs
and the test set had 38,960 URLs. The hyper parameters such
as the rate at which the model learns, total number of hidden
layers, drop out, epochs trained and batch size are tuned and
the model is trained using the training set. The number of
epochs is set to be 200 by observing the point till which the
validation error decreases. The validation accuracy of the
model during training and testing in different epochs is
shown as a plot in Fig.3 and Fig.4 illustrates the validation
loss of the model during training and testing in different
epochs.

Fig. 3. Number of epochs vs. accuracy

Fig. 2. Model diagram


Fig. 4. Number of epochs vs. loss

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on August 06,2021 at 06:26:34 UTC from IEEE Xplore. Restrictions apply.
F. Model deployment
It is clear from Fig.3 and 4 that the validation accuracy
The ensemble model with a precision of 97% has been
increases and the loss decreases after each epoch. Therefore,
deployed with a simple user interface, wherein if we type an
the training was stopped after 200 epochs and the final cross-
URL and hit enter button, a pop up window appears stating
validation accuracy was 96.02%. The batch size used to fit
whether the entered URL is benign or malicious/phishing.
the model was 32.
The sample screenshots of the deployed model’s UI while
checking benign and malicious URL were shown in Fig 7
E. Testing the model and Fig 8.
The trained model was tested with the test set consisting
of 38,960 URLs and the predictions are made. The
classification chart showing the accuracy, precision, recall
and F1 score are shown in Fig.5. It can be seen that the
precision in finding phishing URLs was 97% and the
accuracy was 96%. The model has a high recall and high
precision, and hence it proves to be a very good classifier.
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
F1 score = 2* ((precision* recall) / (precision + recall))
where,
TP - True positives
FP - False positives
FN - False negatives

Fig. 7. UI showing benign URL

Fig. 5. Classification chart

The confusion matrix of the classifier was visualized in


Fig 6. It was observed that the proposed ensemble model
outperformed the model which combines CNN and RNN
with images and text discussed in related work. [12]

Fig. 8. UI showing phishing URL

IV. CONCLUSION AND FUTURE WORK


The proposed system which is an ensemble model of
RNN and CNN performs well with a very high precision,
recall and accuracy. The model was also deployed with a
simple UI and it can be used as a plug-in component in web
browsers also. The model was trained for 200 epochs.
Therefore, it takes a lot of time for training the model. And, it
was observed that this ensemble model outperforms many of
Fig. 6. Confusion matrix the existing works done using machine learning techniques.
Moreover, one more advantage is that the time consuming
feature extraction process wasn’t needed at all. This work can

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on August 06,2021 at 06:26:34 UTC from IEEE Xplore. Restrictions apply.
be extended by using bidirectional LSTM and CNN or trying [8] K. L. Chiew, C. L. Tan, K. S. Wong, S. C. Y. Kelvin, and K. T. Wei,
other types of ensemble models. “A new hybrid ensemble feature selection framework for machine
learning-based phishing detection system,” Information Sciences, vol.
484, pp. 153– 166, 2019.
REFERENCES [9] H. Le, Q. Pham, D. Sahoo, and S. C. Hoi, "URLnet: Learning a URL
[1] Y. Cao, W. Han, Y. Le “Anti-phishing based on automated individual representation with deep learning for malicious URL detection,"
white-list”, Proceedings of the 4th ACM Workshop on Digital presented at the arXiv preprint arXiv:1802.03162, Washington, DC,
Identity Management DIM ’08 (pp. 51–60), ACM (2008) US, 2 March 2018, 2018.
[2] Tyler Moore, Richard Clayton, and Henry Stern. Temporal [10] P. Yi, Y. Guan, F. Zou, Y. Yao, W. Wang, and T. Zhu, "Web phishing
correlations between spam and phishing websites. In Proceedings of detection using a deep learning framework," Wireless
the 2nd USENIX conference on Large-scale exploits and emergent Communications and Mobile Computing, vol. 2018, pp. 1-9, 2018.
threats: botnets, spyware, worms, and more (LEET'09). USENIX [11] M. Nivaashini and R. S. Soundariya, “Deep stacked autoencoder
Association, Berkeley, CA, USA, 5-5, 2009. based feature representation for phishing URLs detection,” Journal of
[3] Ebubekir Buber, Banu Dırı, Ozgur Koray Sahingoz, “Detecting Advanced Research in Dynamical and Control Systems, vol. 9, no. 6,
phishing attacks from URL by using NLP techniques”, International pp. 904–916, 2017.
conference on Computer Science and engineering (UBMK) (pp,337- [12] M. A. Adebowale, K. T. Lwin and M. A. Hossain, "Deep Learning
342) with Convolutional Neural Network and Long Short-Term Memory
[4] Yue Zhang, Jason Hong, Lorrie Cranor, "CANTINA: A Content- for Phishing Detection," 2019 13th International Conference on
Based Approach to Detecting Phishing Web Sites", Proceedings of the Software, Knowledge, Information Management and Applications
16th international conference on World Wide Web, pages 639-648 (SKIMA), Island of Ulkulhas, Maldives, 2019, pp. 1-8.
[5] Guang Xiang, Jason Hong, Carolyn P. Rose, Lorrie Cranor, [13] Zhao J., Wang N., Ma Q., Cheng Z. (2019) Classifying Malicious
"CANTINA+: A Feature-Rich Machine Learning Framework for URLs Using Gated Recurrent Neural Networks. In: Barolli L., Xhafa
Detecting Phishing Web Sites", ACM Transactions on Information F., Javaid N., Enokido T. (eds) Innovative Mobile and Internet
and System Security (TISSEC TISSEC Homepage archive, Volume Services in Ubiquitous Computing. IMIS 2018. Advances in
14 Issue 2, September 2011, Article No. 21 Intelligent Systems and Computing, vol 773. Springer, Cham
[6] N. Sadeh, A. Tomasic,and I Fette, “Learning to detect phishing [14] Ding, Yan et al. “A keyword-based combination approach for
emails”, Proceedings ofthe16thinternational conference on world wide detecting phishing webpages.” Comput. Secur. 84 (2019): 256-275.
web, pp.649–656, 2007. [15] https://www.graphviz.org/download/
[7] M. Moghimi and A. Y. Varjani, “New rule-based phishing detection
method,” Expert Systems with Applications, vol. 53, pp. 231–242,
2016.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on August 06,2021 at 06:26:34 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like