You are on page 1of 24

Phishing URL Detection using ML

PROJECT REPORT

Submitted by

18BCE0256 - Y . Krishna Chaitanya


18BCE0257 - M. Raja Sekhar
19BCE0512 - S. Venkata Phani Kumar

SUBMITTED FOR THE COURSE

CSE4003 Cyber Security


( Slot E1 )

Computer Science and Engineering

Under the guidance of

Prof. Manikandan. K
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

1 Abstract 2
2 Introduction 2
3 Literature Review 3-5

4 Problem Formulation 5

5 Objectives and Algorithm 5-9

6 Methodology 10 - 11

7 Flow Diagrams 12

8 Implementation 13 - 14

9 Results and Discussion 15

10 Conclusion and Future Work 16 - 17

11 Code Snippets 17 - 22

12 Appendix 22

13 References 22 - 23

1
1. Abstract :

Phishing is one of the major threats in this internet era. Phishing is


a smart process where a legitimate website is cloned and victims are lured
to the fake website to provide their personal as well as confidential
information, Sometimes it proves to be costly. Though most of the
websites will give a disclaimer warning to the users about phishing, users
tend to neglect it. It is not a fully responsible action by the websites also
and there is not much that the websites could really do about it. Since
phishing has been in persistence for a long time, many approaches have
been proposed in the past that can detect phishing websites but very few
or none of them detect the target websites for these phishing attacks,
accurately.

In our proposed method we identify phishing websites using a


combined approach by constructing Resource Description Framework
(RDF) models and using ensemble learning algorithms for the
classification of websites. Our approach uses supervised learning
techniques to train our system. This approach has a promising true
positive rate of 98.8%, which is definitely appreciable. As we have used a
random forest classifier that can handle missing values in the dataset, we
were able to reduce the false positive rate of the system to an extent of
1.5%. As our system explores the strength of RDF and ensemble learning
methods and both these approaches work hand in hand, a highly
promising accuracy rate of 98.68% is achieved.

2. Introduction:

As COVID-19 spreads around the world, it is clear that the use of


the web and online services is accelerating, confirming the importance of
this new technology in our modern world.

One of the most widely recognized online security dangers are


Phishing attacks. The purpose of this fraud is to imitate a real website, for
example, internet banking, e-Commerce, or social networking so as to
acquire confidential data such as user-names, passwords, financial and
health-related information from potential victims.

2
3. Literature Review :

Title Objectives Limitations


1. Detection of Classification models The primary advantage of
that detect phishing web blacklists is that querying
phishing URLs
sites by the analysis of is a low overhead
using Machine lexical and host-based operation: the lists of
features of URLs. They malicious sites are
Learning
analyzed different precompiled, so the only
techniques classifying algorithms in computational cost of
the Waikato deployed blacklists is the
Environment for lookup overhead. However,
Knowledge Analysis the need to construct these
(WEKA) workbench lists in advance give rise to
and MATLAB. their disadvantage that
blacklists become stale.
Network administrators
block existing malicious
sites, and enforcement
efforts take down criminal
enterprises behind those
sites. There is a constant
pressure on criminals to
construct new sites and to
find new hosting
infrastructure. As a result,
new malicious URLs are
introduced and blacklist
providers must update their
lists yet again. However, in
this process, criminals are
always ahead because Web
site construction is
inexpensive. Moreover,
free services for blogs e.g.,
Blogger and personal
hosting e.g., Google Sites,
Microsoft Live Spaces
provide another
inexpensive source of
disposable sites.

3
In this paper, they The drawback of this
discussed three system is detecting some
2. Detection and
approaches for detecting minimal false positive and
Prevention of phishing websites. First
false negative results.
is by analyzing various
Phishing Websites These drawbacks can be
features of the URL,
using Machine second is by checking eliminated by introducing
legitimacy of a website much richer features to
by knowing where the feed to the machine
website is being hosted learning algorithm that
and who is managing it, would result in much
the third approach uses
higher accuracy.
visual appearance based
analysis for checking
genuineness of the
website. We make use of
Machine Learning
techniques and
algorithms for
evaluation of these
different features of
URL and websites.
3. Phishing In this paper, they Decision trees Bayes Net,
Detection: A critically analysed
and SVM achieved good
Recent Intelligent recent studies related to
phishing in the research detection rates. However,
Machine Learning
literature based on ML
Comparison models extracted by
techniques. We show
based on Models how these ML decision trees showed very
Content and approaches derive the
large amounts of
Features. classification models
and their advantages and information which may
disadvantages. More
overwhelm novice users
importantly, we
investigate in-depth and security experts, and
eight ML techniques on
thus will be hard to manage
real datasets related to
phishing and perform or understand. Moreover,
thorough comparisons
Bayes Net and SVM
of these techniques. The
aim of the comparisons showed good performance

4
is to determine a with respect to accuracy,
suitable approach that
yet their models are hard to
may serve as an anti
phishing tool, based on understand by end-users.
the model content as
well as the detection rate
of phishing activities.

4. Problem Formulation :

Most of the previously existing methodologies implemented a


blacklisting and whitelisting process, where the URL is compared with
the existing list of URLs and decision is taken based on the list that the
url belongs to.
In our project we are going to extract the url features and build a
model using ML, using that model we predict the input url’s legitimacy.

5. Objectives and Algorithm :

The main objective of this project is to predict whether a given URL is


Phishing or Safe.

5.1. Decision Trees (DTs) :


Decision Trees are a non-parametric supervised learning method used for
classification and regression. The goal is to create a model that predicts
the value of a target variable by learning simple decision rules inferred
from the data features. A tree can be seen as a piecewise constant
approximation.

Merits :
● Simple to understand and to interpret. Trees can be visualised.

5
● Requires little data preparation. Other techniques often require data
normalisation, dummy variables need to be created and blank
values to be removed. Note however that this module does not
support missing values.
● DT can handle both numerical and categorical data.
● Decision trees provide a clear indication of which fields are most
important for prediction or classification.

Demerits :

● Decision-tree learners can create over-complex trees that do not


generalise the data well. This is called overfitting. Mechanisms
such as pruning, setting the minimum number of samples required
at a leaf node or setting the maximum depth of the tree are
necessary to avoid this problem.
● Decision trees can be unstable because small variations in the data
might result in a completely different tree being generated.
● Decision trees can be computationally expensive to train.

5.2. Logistic Regression :

Logistic regression is one of the most popular Machine Learning


algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of
independent variables. Logistic regression predicts the output of a
categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, True or

6
False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

On the basis of the categories, Logistic Regression can be classified into


three types:

● Binomial: In binomial Logistic regression, there can be only two


possible types of the dependent variables, such as 0 or 1, Pass or
Fail, etc.

● Multinomial: In multinomial Logistic regression, there can be 3 or


more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"

● Ordinal: In ordinal Logistic regression, there can be 3 or more


possible ordered types of dependent variables, such as "low",
"Medium", or "High".

5.3. SVM - Support Vector Machine :

Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as
Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.

7
SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called support vectors, and hence the
algorithm is termed as Support Vector Machine.

SVM is classified into two types:

● Linear SVM: Linear SVM is used for linearly separable data,

which means if a dataset can be classified into two classes by using

a single straight line, then such data is termed as linearly separable

data, and classifier is used called as Linear SVM classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly

separated data, which means if a dataset cannot be classified by

using a straight line, then such data is termed as non-linear data and

classifier used is called as Non-linear SVM classifier.

5.4. Random Forest :

Random forest is a popular machine learning algorithm that belongs to


supervised learning technique.It can be used for both classification and
regression problems.It is based on the concept of ensemble
learning,which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

Random Forest is a classifier that contains a number of decision trees on


various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset. Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.

8
The greater number of trees in the forest leads to higher accuracy
and prevents the problem of overfitting.

Assumptions for Random Forest:

Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not. But together, all the trees predict the correct
output. Therefore, below are two assumptions for a better Random forest
classifier:

● There should be some actual values in the feature variable of the

dataset so that the classifier can predict accurate results rather than

a guessed result.

● The predictions from each tree must have very low correlations.

Advantages of Random Forest:

● Random Forest is capable of performing both Classification and

Regression tasks.

● It is capable of handling large datasets with high dimensionality.

● It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest:

● Although random forest can be used for both classification and regression

tasks, it is not more suitable for Regression tasks.

● Also the cost of implementation is too high.

9
6. Methodology :

Phishing Domain Detection using Features Engineering There are a


lot of algorithms and a wide variety of data types for phishing detection
in the academic literature and commercial products. A phishing URL and
the corresponding page have several features which can be differentiated
from a malicious URL. For example; an attacker can register a long and
confusing domain to hide the actual domain name (Cybersquatting,
Typosquatting). In some cases attackers can use direct IP addresses
instead of using the domain name.
This type of event is out of our scope, but it can be used for the
same purpose. Attackers can also use short domain names which are
irrelevant to legitimate brand names and don’t have any FreeUrl addition.
But these types of web sites are also out of our scope, because they are
more relevant to fraudulent domains instead of phishing domains.

Beside URL-Based Features, different kinds of features which are


used in machine learning algorithms in the detection process of academic
studies are used. Features collected from academic studies for the
phishing domain detection with machine learning techniques are grouped
as given below.

1. URL-Based Features
2. Domain-Based Features
3. Page-Based Features
4. Content-Based Features

URL-Based Features URL is the first thing to analyse a website to decide


whether it is a phishing or not. As we mentioned before, URLs of

10
phishing domains have some distinctive points. Features which are
related to these points are obtained when the URL is processed.
Some of URL-Based Features are given below.
1. Digit count in the URL
2. Total length of URL
3. Checking whether the URL is Typosquatting or not. (google.com
→ goggle.com)
4. Checking whether it includes a legitimate brand name or not
(apple-icloud-login.com)
5. Number of subdomains in URL
6. Is Top Level Domain (TLD) one of the commonly used one?

machine learning algorithms and each algorithm has its own working
mechanism. In this project, we have explained the Decision Tree
Algorithm, because I think this algorithm is simple and powerful.

Modules included:
1. Data training : We have used the Random Forest algorithm to train
our data set.
2. FrontEnd and Server maintenance : A localhost server is created
and all the required HTML files are hosted over there. This module
will take care of flow of the data among the programme files.
3. Extracting URL features : This module takes the URL and pass it
through various filters and extract the features like domain,
protocols, sub-domain, SSL certificates etc..,
4. Predicting of URL type : This module takes the output of the
previous module and processes it and assigns a flag value of the
URL, which later helps in identifying its safety.

11
7. Flow Diagrams :

Fig1. Module flow diagram

Fig2. Model selection flow diagram

12
8. Implementation :

13
14
9. Results and Discussion :

● System info :

○ Hardware specifications :
- Intel core processor ( i5 - recommended )
- Memory : 2GB ( 4GB - recommended )
- Disk space : 1GB ( >1GB - recommended )

○ Software specification :
- Windows OS
- Python3 : with required modules
- Jupyter notebook ( Google Colab - recommended )
- IDE to code the front end( VS Code-recommended)
- Browser ( Chrome - recommended )
- Modules supporting server maintenance

● Dataset :
- Phishcoop.csv ( taken from UCI-repo )
- Contains 11055 entries each with 32 - attributes
- No null entries
- 6157 - positive examples, 4898 - negative examples

● Input type :
- URL of a site to be verified

● We used different algorithms to generate our model like

15
○ Decision Tree
○ Logistic Regression
○ Support-Vector Machine
○ Random Forest
Out of these Random Forest gives a maximum accuracy score of 98.6%.
So we generated a finalised_model.pkl which predicts the input urls.

● RF gained an accuracy of 98.6% while testing but after storing the


predicted values and testing again it gave an accuracy of 99.4%.
● The model generated by RF is saved and sent to a validation.py file
to predict the input url’s legitimacy.

10. Conclusion and Future work :

This project helps in detection of phishing attacks as they are


carried out to individuals and to organizations. It proposed the use of
technological factors and the human factor to end the threat posed by
phishing attacks and we believe that it can effectively help to reduce their
impacts on individuals.

16
Because of the threat posed by phishing attacks, more research
should still be carried out to add on the existing knowledge solutions.
Hackers are still creating new ways to exploit the human trust nature.
And a more adequate technique for model testing should be considered to
help in a better way of validation for a model before its deployment in the
real world.

Future work :
Our project has some limitations checking multiple domains and ip
addresses. So we are planning to overcome those limitations in the
coming future. Then we’ll try to make it as a chrome extension and
deploy to real use.

11. Code Snippets :

Fig3. Loading dataset

17
Fig4. Decision tree rules generation

Fig5. Decision tree accuracy score

18
Fig6. Logistic regression correlation among features

Fig7. Logistic regression accuracy score

19
Fig8. SVM accuracy score

Fig9. Random Forest implementation

20
Fig10. Random Forest accuracy score

Fig11. Accuracy scores

21
Fig12. Accuracy comparison graph

12. Appendix :

● Colab file :
https://colab.research.google.com/drive/1ehQDur3iPhPpa2r2GArdtF5Qv
6DtRPjR?usp=sharing

● Project files :
https://github.com/Krishnachaitanya-learn/Phishing_detection4QgbIchQf
xkmOCllw4CX4X_GV?usp=sharing

13. References :

1. James, J., Sandhya, L., & Thomas, C. (2013, December).


Detection of phishing URLs using machine learning techniques.
In 2013 International conference on control communication and
computing (ICCC) (pp. 304-309). IEEE.
2. Patil, V., Thakkar, P., Shah, C., Bhat, T., & Godse, S. P. (2018,
August). Detection and prevention of phishing websites using
machine learning approach. In 2018 Fourth international conference
on computing communication control and automation (ICCUBEA)
(pp. 1-5). IEEE.

22
3. Abdelhamid, N., Thabtah, F., & Abdel-jaber, H. (2017, July).
Phishing detection: A recent intelligent machine learning comparison
based on models content and features. In 2017 IEEE international
conference on intelligence and security informatics (ISI) (pp. 72-77).
IEEE.

23

You might also like