Professional Documents
Culture Documents
PROJECT REPORT
Submitted by
Prof. Manikandan. K
TABLE OF CONTENTS
1 Abstract 2
2 Introduction 2
3 Literature Review 3-5
4 Problem Formulation 5
6 Methodology 10 - 11
7 Flow Diagrams 12
8 Implementation 13 - 14
11 Code Snippets 17 - 22
12 Appendix 22
13 References 22 - 23
1
1. Abstract :
2. Introduction:
2
3. Literature Review :
3
In this paper, they The drawback of this
discussed three system is detecting some
2. Detection and
approaches for detecting minimal false positive and
Prevention of phishing websites. First
false negative results.
is by analyzing various
Phishing Websites These drawbacks can be
features of the URL,
using Machine second is by checking eliminated by introducing
legitimacy of a website much richer features to
by knowing where the feed to the machine
website is being hosted learning algorithm that
and who is managing it, would result in much
the third approach uses
higher accuracy.
visual appearance based
analysis for checking
genuineness of the
website. We make use of
Machine Learning
techniques and
algorithms for
evaluation of these
different features of
URL and websites.
3. Phishing In this paper, they Decision trees Bayes Net,
Detection: A critically analysed
and SVM achieved good
Recent Intelligent recent studies related to
phishing in the research detection rates. However,
Machine Learning
literature based on ML
Comparison models extracted by
techniques. We show
based on Models how these ML decision trees showed very
Content and approaches derive the
large amounts of
Features. classification models
and their advantages and information which may
disadvantages. More
overwhelm novice users
importantly, we
investigate in-depth and security experts, and
eight ML techniques on
thus will be hard to manage
real datasets related to
phishing and perform or understand. Moreover,
thorough comparisons
Bayes Net and SVM
of these techniques. The
aim of the comparisons showed good performance
4
is to determine a with respect to accuracy,
suitable approach that
yet their models are hard to
may serve as an anti
phishing tool, based on understand by end-users.
the model content as
well as the detection rate
of phishing activities.
4. Problem Formulation :
Merits :
● Simple to understand and to interpret. Trees can be visualised.
5
● Requires little data preparation. Other techniques often require data
normalisation, dummy variables need to be created and blank
values to be removed. Note however that this module does not
support missing values.
● DT can handle both numerical and categorical data.
● Decision trees provide a clear indication of which fields are most
important for prediction or classification.
Demerits :
6
False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
7
SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called support vectors, and hence the
algorithm is termed as Support Vector Machine.
using a straight line, then such data is termed as non-linear data and
8
The greater number of trees in the forest leads to higher accuracy
and prevents the problem of overfitting.
Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct
output, while others may not. But together, all the trees predict the correct
output. Therefore, below are two assumptions for a better Random forest
classifier:
dataset so that the classifier can predict accurate results rather than
a guessed result.
● The predictions from each tree must have very low correlations.
Regression tasks.
● It enhances the accuracy of the model and prevents the overfitting issue.
● Although random forest can be used for both classification and regression
9
6. Methodology :
1. URL-Based Features
2. Domain-Based Features
3. Page-Based Features
4. Content-Based Features
10
phishing domains have some distinctive points. Features which are
related to these points are obtained when the URL is processed.
Some of URL-Based Features are given below.
1. Digit count in the URL
2. Total length of URL
3. Checking whether the URL is Typosquatting or not. (google.com
→ goggle.com)
4. Checking whether it includes a legitimate brand name or not
(apple-icloud-login.com)
5. Number of subdomains in URL
6. Is Top Level Domain (TLD) one of the commonly used one?
machine learning algorithms and each algorithm has its own working
mechanism. In this project, we have explained the Decision Tree
Algorithm, because I think this algorithm is simple and powerful.
Modules included:
1. Data training : We have used the Random Forest algorithm to train
our data set.
2. FrontEnd and Server maintenance : A localhost server is created
and all the required HTML files are hosted over there. This module
will take care of flow of the data among the programme files.
3. Extracting URL features : This module takes the URL and pass it
through various filters and extract the features like domain,
protocols, sub-domain, SSL certificates etc..,
4. Predicting of URL type : This module takes the output of the
previous module and processes it and assigns a flag value of the
URL, which later helps in identifying its safety.
11
7. Flow Diagrams :
12
8. Implementation :
13
14
9. Results and Discussion :
● System info :
○ Hardware specifications :
- Intel core processor ( i5 - recommended )
- Memory : 2GB ( 4GB - recommended )
- Disk space : 1GB ( >1GB - recommended )
○ Software specification :
- Windows OS
- Python3 : with required modules
- Jupyter notebook ( Google Colab - recommended )
- IDE to code the front end( VS Code-recommended)
- Browser ( Chrome - recommended )
- Modules supporting server maintenance
● Dataset :
- Phishcoop.csv ( taken from UCI-repo )
- Contains 11055 entries each with 32 - attributes
- No null entries
- 6157 - positive examples, 4898 - negative examples
● Input type :
- URL of a site to be verified
15
○ Decision Tree
○ Logistic Regression
○ Support-Vector Machine
○ Random Forest
Out of these Random Forest gives a maximum accuracy score of 98.6%.
So we generated a finalised_model.pkl which predicts the input urls.
16
Because of the threat posed by phishing attacks, more research
should still be carried out to add on the existing knowledge solutions.
Hackers are still creating new ways to exploit the human trust nature.
And a more adequate technique for model testing should be considered to
help in a better way of validation for a model before its deployment in the
real world.
Future work :
Our project has some limitations checking multiple domains and ip
addresses. So we are planning to overcome those limitations in the
coming future. Then we’ll try to make it as a chrome extension and
deploy to real use.
17
Fig4. Decision tree rules generation
18
Fig6. Logistic regression correlation among features
19
Fig8. SVM accuracy score
20
Fig10. Random Forest accuracy score
21
Fig12. Accuracy comparison graph
12. Appendix :
● Colab file :
https://colab.research.google.com/drive/1ehQDur3iPhPpa2r2GArdtF5Qv
6DtRPjR?usp=sharing
● Project files :
https://github.com/Krishnachaitanya-learn/Phishing_detection4QgbIchQf
xkmOCllw4CX4X_GV?usp=sharing
13. References :
22
3. Abdelhamid, N., Thabtah, F., & Abdel-jaber, H. (2017, July).
Phishing detection: A recent intelligent machine learning comparison
based on models content and features. In 2017 IEEE international
conference on intelligence and security informatics (ISI) (pp. 72-77).
IEEE.
23