You are on page 1of 49

Detecting Phishing Website

Using
Machine Learning

INDEX
CONTENTS PAGE NO

1. INTRODUCTION
1.1 Overview
1.2 Purpose

2. LITERATURE SURVEY
2.1 Existing System
2.2 Proposed System

3. THERORITICAL ANALYSIS
3.1 Block Diagram
3.2 Software / Hardware

4. EXPERIMENTAL INVESTIGATIONS

5. FLOW CHART

6. RESULT

7. ADVANTAGES AND DISADVANTAGES

8. APPLICATIONS

9. CONCLUSION

10. FUTURE SCOPE

11. BIBLIOGRAPHY

12. APPENDIX
INTRODUCTION

1.1 Overview :

There are number of users who purchase products online and


make payment through e- banking. There are e-banking
websites who ask user to provide sensitive data such as
username, password or credit card details etc. often for
malicious reasons. This type of e-banking websites is known
as phishing website. Web service is one of the key
communications software services for the Internet.

Web phishing is one of many security threats to web services


on the Internet Phishing is a form of identity theft that occurs
when a malicious Web site impersonates a legitimate one in
order to acquire sensitive information such as passwords,
account details, or credit card numbers.. Phishing is a
deception technique that utilizes a combination of social
engineering and technology to gather sensitive and personal
information, such as passwords and credit card details by
masquerading as a trustworthy person or business in an
electronic communication. Phishing makes use of spoofed
emails that are made to look authentic and purported to be
coming from legitimate sources like financial institutions,
ecommerce sites etc., to lure users to visit fraudulent websites
through links provided in the phishing email.

It will lead to information disclosure and property damage.


This paper mainly focuses on applying a machine learning
algorithms to detect phishing websites.
1.2 Purpose :

Phishing is a website forgery with an intention to track and steal the


sensitive information of online users.

The attacker fools the user with social engineering techniques such
as SMS, voice, email, website and malware.

Typically a victim receives a message that appears to have been sent


by a known contact or organization. The message contains
malicious software targeting the user’s computer or has links to
direct victims to malicious websites in order to trick them into
divulging personal and financial information, such as passwords,
account IDs or credit card details. To avoid that we implemented an
application. which concentrates on URL and Website Content of
phishing page.

2. LITERATURE SURVEY

2.1 Existing Problem:


In existing system there is a process called Fuzzification In this
step, linguistic descriptors such as High, Low, Medium, for
example, are assigned to a range of values for each key phishing
characteristic indicators. Valid ranges of the inputs are considered
and divided into classes, or fuzzy sets. For example, length of URL
address can range from ‘low’ to ‘high’ with other values in between.

We cannot specify clear boundaries between classes. The degree of


belongingness of the values of the variables to any selected class is
called the degree of membership; Membership function is designed
for each Phishing characteristic indicator, which is a curve that
defines how each point in the input space is mapped to a
membership value between [0, 1].

Linguistic values are assigned for each Phishing indicator as Low,


Moderate, and high while for e-banking Phishing website risk rate
as Very legitimate, Legitimate, Suspicious, Phishy, and Very phishy.
2.2 PROPOSED SOLUTION :
In order to detect and predict e-banking phishing website, we
proposed an intelligent, flexible and effective system that is based
on using classification algorithms. We implemented classification
algorithm and techniques to extract the phishing datasets criteria
to classify their legitimacy. The e-banking phishing website can be
detected based on the some important characteristics like URL and
domain identity, and security and encryption criteria in the final
phishing detection rate. Once user makes transaction through
online when he makes payment

Through e-banking website our system will use data mining


algorithm to detect whether the e-banking website is phishing
website or not.

Phishing is a form of identity theft that occurs when a malicious


Web site impersonates a legitimate one in order to acquire sensitive
information such as passwords, account details, or credit card
numbers.. Phishing is a deception technique that utilizes a
combination of social engineering and technology to gather
sensitive and personal information, such as passwords and credit
card details by masquerading as a trustworthy person or business
in an electronic communication.

Phishing makes use of spoofed emails that are made to look


authentic and purported to be coming from legitimate sources like
financial institutions, ecommerce sites etc., to lure users to visit
fraudulent websites through links provided in the phishing email.
3. THEORITICAL ANALYSIS

3.1 Block Diagram :

The work consists of host based, page based and lexical feature
extraction of collected URLs and analysis. The first step is the
collection of phishing and benign URLs. The host based, popularity
based and lexical based feature extractions are applied to form a
database of feature values. The database is knowledge mined using
different machine learning

We need to collect the URL’s then Host-based features explain


“where” phishing sites are hosted, “who” they are managed by,
and “how” they are administered. We use these features because
phishing Web sites may be hosted in less reputable hosting
centres, on machines that are not usual Web hosts, or through
not so reputable registrars.
A WHOIS property gives details about the date of registration,
update and expiry, who is the registrar and the registrant. If
phishing sites are taken down frequently, the registration dates will
be newer than for legitimate sites A large number of phishing
websites contain IP address in their hostname. So getting the
details of such Host names will be helpful in efforts to point to
phishing sites, which can be obtained from the Whois properties.

3.2 Hardware Requirements :


1 gigahertz (GHz) or System on a
Processor:
Chip (SoC)
RAM: 1 GB for 32-bit or 2 GB for 64-bit
Hard drive space: 16 GB for 32-bit OS 32 GB for 64-bit OS

Graphics card: DirectX 9 or later with WDDM 1.0 driver

Display: 15.6 inches

3.3 Software Requirements :

Windows 10
Anaconda
Spyder
Jupyter
Python IDLE
Visual Studio Code

4. EXPERIMENTAL INVESTIGATIONS
When you tag a face in a Facebook photo, it is AI that is
running behind the scenes and identifying faces in a picture.
Face tagging is now omnipresent in several applications that
display pictures with human faces. Why just human faces?
There are several applications that detect objects such as cats,
dogs, bottles, cars, etc. We have autonomous cars running on
our roads that detect objects in real time to steer the car. When
you travel, you use Google Directions to learn the real-time
traffic situations and follow the best path suggested by Google at
that point of time. This is yet another implementation of object
detection technique in real time. Let us consider the example of
Google Translate application that we typically use while visiting
foreign countries.
You can imagine the complexity involved in developing this kind
of application considering that there are multiple paths to your
destination and the application has to judge the traffic situation
in every possible path to give you a travel time estimate for each
such path. Besides, consider the fact that Google Directions
covers the entire globe. Undoubtedly, lots of AI and Machine
Learning techniques are in-use under the hoods of such
applications.

Statistical Techniques The development of today’s AI applications


started with using the age-old traditional statistical techniques.
You must have used straight-line interpolation in schools to
predict a future value. Some of the examples of statistical
techniques that are used for developing AI applications in those
days and are still in practice are listed here:

 Classification
 Clustering
 Probability Theories
 Decision Trees

Categories of Machine Learning

Machine learning evolved from left to right as shown in the


above diagram. Initially, researchers started out with
Supervised Learning. This is the case of housing price prediction
discussed earlier.

This was followed by unsupervised learning, where the machine


is made to learn on its own without any supervision.
Scientists discovered further that it may be a good idea to
reward the machine when it does the job the expected way and
there came the Reinforcement Learning.

Very soon, the data that is available these days has become so
humongous that the conventional techniques developed so far
failed to analyze the big data and provide us the predictions.

The machine now learns on its own using the high computing
power and huge memory resources that are available today.

Machine Learning :

Machine learning (ML) is the study of computer algorithms that


improve automatically through experience. It is seen as a subset
of artificial intelligence. Machine learning algorithms build a
mathematical model based on sample data, known as "training
data", in order to make predictions or decisions without being
explicitly programmed to do so.

Machine learning algorithms are used in a wide variety of


applications, such as email filtering and computer vision, where
it is difficult or infeasible to develop conventional algorithms to
perform the needed tasks.

Machine learning is closely related to computational statistics,


which focuses on making predictions using computers. The
study of mathematical optimization delivers methods, theory and
application domains to the field of machine learning. Data
mining is a related field of study, focusing on exploratory data
analysis through unsupervised learning. In its application across
business problems, machine learning is also referred to as
predictive analytics.

Supervised Learning :

Supervised learning is analogous to training a child to walk. You


will hold the child’s hand show him how to take his foot forward,
walk yourself for a demonstration and so on, until the child
learns to walk on his own.
Classification :

You may also use machine learning techniques for classification


problems. In classification problems, you classify objects of
similar nature into a single group.

For example, in a set of 100 students say, you may like to group
them into three groups based on their heights- short, medium
and long.

Measuring the height of each student, you will place them in a


proper group.

Now, when a new student comes in, you will put him in an
appropriate group by measuring his height.

By following the principles in regression training, you will train


the machine to classify a student based on his feature – the
height.

When the machine learns how the groups are formed, it will be
able to classify any unknown new student correctly.

Once again, you would use the test data to verify that the
machine has learned your technique of classification before
putting the developed model in production.

Supervised Learning is where the AI really began its journey.


This technique was applied successfully in several cases.

You have used this model while doing the hand-written


recognition on your machine. Several algorithms have been
developed for supervised learning

Unsupervised Learning :

In unsupervised learning, we do not specify a target variable to


the machine, rather we ask machine “What can you tell me
about X?”. More specifically, we may ask questions such as given
a huge data set X, “What are the five best groups we can make
out of X?” or “What features occur together most frequently in
X?”. To arrive at the answers to such questions, you can
understand that the number of data points that the machine
would require to deduce a strategy would be very large. In case
of supervised learning, the machine can be trained with even
about few thousands of data points. However, in case of
unsupervised learning, the number of data points that is
reasonably accepted for learning starts in a few millions. These
days, the data is generally abundantly available.

The data ideally requires curating. However, the amount of data


that is continuously flowing in a social area network, in most
cases data curation is an impossible task. The following figure
shows the boundary between the yellow and red dots as
determined by unsupervised machine learning. You can see it
clearly that the machine would be able to determine the class of
each of the black dots with a fairly good accuracy.

The unsupervised learning has shown a great success in many


modern AI applications, such as face detection, object detection,
and so on.

Reinforcement Learning :

Consider training a pet dog, we train our pet to bring a ball to


us. We throw the ball at a certain distance and ask the dog to
fetch it back to us. Every time the dog does this right, we reward
the dog. Slowly, the dog learns that doing the job rightly gives
him a reward and then the dog starts doing the job right way
every time in future. Exactly, this concept is applied in
“Reinforcement” type of learning. The technique was initially
developed for machines to play games.

The machine is given an algorithm to analyze all possible moves


at each stage of the game. The machine may select one of the
moves at random. If the move is right, the machine is rewarded,
otherwise it may be penalized. Slowly, the machine will start
differentiating between right and wrong moves and after several
iterations would learn to solve the game puzzle with a better
accuracy. The accuracy of winning the game would improve as
the machine plays more and more games.
The entire process may be depicted in the following diagram:

This technique of machine learning differs from the supervised


learning in that you need not supply the labelled input/output
pairs. The focus is on finding the balance between exploring the
new solutions versus exploiting the learned solutions.

Decision Tree:

Decision Trees are an important type of algorithm for predictive


modeling machine learning. The classical decision tree
algorithms have been around for decades and modern variations
like random forest are among the most powerful techniques
available by it’s more modern name CART which stands for
Classification And Regression Trees. After reading this
chapter, you will know:

The many names used to describe the CART algorithm for


machine learning.

The representation used by learned CART models that is


actually stored on disk. How a CART model can be learned from
training data. How a learned CART model can be used to make
predictions on unseen data.
Additional resources that you can use to learn more about
CART and related algorithms. If you have taken an algorithms
and data structures course, it might be hard to hold you back
from implementing this simple and powerful algorithm. And from
there, you’re a small step away from your own implementation of
Random Forests Decision Trees

Classification and Regression Trees or CART for short is a term


introduced by Leo Breiman to refer to Decision Tree algorithms
that can be used for classification or regression predictive
modeling problems.

Classically, this algorithm is referred to as decision trees, but on


some platforms like R they are referred to by the more modern
term CART.
The CART algorithm provides a foundation for important
algorithms like bagged decision trees, random forest and boosted
decision trees.

Random Forest :

Random Forests are an improvement over bagged decision trees.


A problem with decision trees like CART is that they are greed As
such, even with Bagging, the decision trees can have a lot of
structural similarities and in turn result in high correlation in
their predictions.
Combining predictions from multiple models in ensembles
works better if the predictions from the sub-models are
uncorrelated or at best weakly correlated.
Random forest changes the algorithm for the way that the sub-
trees are learned so that the resulting predictions from all of the
subtrees have less correlation. It is a simple tweak.

In CART, when selecting a split point, the learning algorithm is


allowed to look through all variables and all variable values in
order to select the most optimal split-point.

The random forest algorithm changes this procedure so that the


learning algorithm is limited to a random sample of features of
which to search.

The number of features that can be searched at each split point


(m) must be specified as a parameter to the algorithm.

Naive Bayes :

Naive Bayes is a simple but surprisingly powerful algorithm for


predictive modeling. The Naive Bayes algorithm for classification

The representation used by naive Bayes that is actually stored


when a model is written to a file. Naive Bayes is a classification
algorithm for binary (two-class) and multiclass classification
problems. The technique is easiest to understand when
described using binary or categorical input values. It is called
naive Bayes or idiot Bayes because the calculation of the
probabilities for each hypothesis are simplified to make their
calculation tractable. Rather than attempting to calculate the
values of each attribute value P(d1, d2, d3|h), they are assumed
to be conditionally independent given the target value and
calculated as P(d1|h) × P(d2|h) and so on.

This is a very strong assumption that is most unlikely in real


data, i.e. that the attributes do not interact. Nevertheless, the
approach performs surprisingly well on data where this
assumption does not hold.

How a learned model can be used to make predictions.

How you can learn a naive Bayes model from training data.

How to best prepare your data for the naive Bayes algorithm.
Where to go for more information on naive Bayes.

MAP(h) = max(P(d|h)) (19.3)

Representation Used:

The representation for naive Bayes is probabilities. A list of


probabilities is stored to file for a learned naive Bayes model.
This includes:

Class Probabilities: The probabilities of each class in the


training dataset.

Conditional Probabilities: The conditional probabilities of each


input value given each class value.

Logistic Regression :

Logistic regression is another technique borrowed by machine


learning from the field of statistics. It is the go-to method for
binary classification problems (problems with two class value.
After reading this chapter you will know:

The many names and terms used when describing logistic


regression (like log odds and logit).The representation used for a
logistic regression model. Techniques used to learn the
coefficients of a logistic regression model from data.

How to actually make predictions using a learned logistic


regression model.

Where to go for more information if you want to dig a little


deeper.

Logistic regression uses an equation as the representation, very


much like linear regression.

Input values (x) are combined linearly using weights or


coefficient values to predict an output value (y). A key difference
from linear regression is that the output value being modelled is
a binary values (0 or 1) rather than a numeric value.
Below is an example logistic regression equation:

Where y is the predicted output, B0 is the bias or intercept term


and B1 is the coefficient for the single input value (x).

Each column in your input data has an associated B coefficient


(a constant real value) that must be learned from your training
data. The actual representation of the model that you would
store in memory or in a file are the coefficients in the
equation(the beta value or B’s).

Support Vector Classification (SVM) :


SVM can be used for both regression and classification tasks.
The objective of the support vector machine algorithm is to find a
hyper plane in an N-dimensional space(N — the number of
features) that distinctly classifies the data points.

Support vectors are data points that are closer to the hyper
plane and influence the position and orientation of the hyper
plane. Using these support vectors, we maximize the margin of
the classifier. Deleting the support vectors will change the
position of the hyper plane. These are the points that help us
build our SVM.

The SVM model can be built by using numpy library and also
can be built using Scikit learn library and just call the related
functions to implement the SVM model.

K-Means :

All the algorithms we discussed so far are supervised, that is,


they assume that labelled training data is available.

In many applications this is too much to hope for; labelling may


be expensive, error prone, or sometimes impossible.

For instance, it is very easy to crawl and collect every page


within the www.purdue.edu domain, but rather time consuming
to assign a topic to each page based on its contents.
In such cases, one has to resort to unsupervised learning. A
prototypical unsupervised learning algorithm is K-means, which
is clustering algorithm. Given X = {x1, . . . , xm} the goal of K-
means is to partition it into k clusters such that each point in a
cluster is similar to points from its own cluster than with points
from some other cluster.

Basic Algorithms

Towards this end, define prototype vectors µ1, . . . , µk and an


indicator vector rij which is 1 if, and only if, xi is assigned to
cluster j.

To cluster our dataset we will minimize the following distortion


measure, which minimizes the distance of each point from the
prototype vector.

ADABOOST:

AdaBoost is best used to boost the performance of decision trees


on binary classification problems.

AdaBoost was originally called AdaBoost.M1 by the authors of


the technique Freund and Schapire. More recently it may be
referred to as discrete AdaBoost because it is used for
classification rather than regression.

AdaBoost can be used to boost the performance of any machine


learning algorithm. It is best used with weak learners. These are
models that achieve accuracy just above random chance on a
classification problem.

The most suited and therefore most common algorithm used


with AdaBoost are decision trees with one level. Because these
trees are so short and only contain one decision for
classification, they are often called decision stumps.

Each instance in the training dataset is weighted. The initial


weight is set to:

weight(xi) = 1/n

Where xi is the i’th training instance and n is the number of


training instances.

Data Preparation for AdaBoost : This section lists some


heuristics for best preparing your data for AdaBoost.

Quality Data: Because the ensemble method continues to


attempt to correct misclassifications in the training data, you
need to be careful that the training data is of a high-quality.

Outliers: Outliers will force the ensemble down the rabbit hole
of working hard to correct for cases that are unrealistic. These
could be removed from the training dataset.

Noisy Data: Noisy data, specifically noise in the output variable


can be problematic. If possible, attempt to isolate and clean
these from your training dataset.

Features Description :
Index  It is just like a serial number

Having_IPhaving_IP Address  An IP address, or "IP," is a


unique address that identifies a device on the Internet or a local
network. It allows a system to be recognized by other systems
connected via the Internet protocol. There are two primary types
of IP address formats used today — IPv4 and IPv6

A domain name is the address where Internet users can access


your website. A domain name is used for finding and identifying
computers on the Internet. ... Because of this, domain names were
developed and used to identify entities on the Internet rather than
using IP addresses.

Getting a domain name involves registering the name you want


with an organisation let’s say "example.com" you will have to go to a
registrar, pay a registration fee that costs around US$10 to US$35
for that name.

If an IP address is used as an alternative of the domain name in the


URL, such as “http://125.98.3.123/fake.html”, users can be sure
that someone is trying to steal their personal information.

Sometimes, the IP address is even transformed into hexadecimal


code as shown in the following link
“http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”.

Using the IP Address instead of the domain name in the URL: An


IP address can be added instead of the domain name to fool users
and hence can steal sensitive information.
Count of -1 in dataset  7262
Count of 1 in dataset-  3793
URLURL_Length  Phishers can use long URL to hide the
doubtful part in the address bar.
For example:
http://federmacedoadv.com.br/3f/aze/ab51e2e319e51502f4
16dbe46b773a5e/?cmd=_home&dispatch=11004d58f5b74f8
dc1e7c2e8dd4105e811004d58f5b74f8dc1e7c2e8dd4105e8@p
hishing.website.html
To ensure accuracy of our study, we calculated the length of
URLs in the dataset and produced an average URL length
The results showed that if the length of the URL is greater
than or equal 54 characters then the URL classified as
phishing.
By reviewing our dataset we were able to find 1220 URLs
lengths equals to 54 or more which constitute 48.8%
of the total dataset size.
Count of 1 in dataset  1960
Count of -1 in dataset  8960
Shortining_Service  URL shortening is a method on the “World
Wide Web” in which a URL may be made considerably smaller in
length and still lead to the required webpage.
This is accomplished by means of an “HTTP Redirect” on a
domain name that is short, which links to the webpage that
has a long URL.

Existence of the https token in the domain part of the URL:


The attackers can use this token to trick the users.

For example: URL “http://portal.hud.ac.uk/” can be


shortened to “bit.ly/19DXSk4”.
Count of 1 in dataset  9611
Count of -1 in dataset  1441

Having_At_Symbol  Using “@” symbol in the URL leads the


browser to ignore everything preceding the “@” symbol and the real
address often follows the “@” symbol.
Simply the browser will ignore the text preceding the @
symbol in the URL. This can also be used to hide the
suspicious part.
Count of 1 in dataset  9400
Count of -1 in dataset  1655

Double_Slash_Redirecting  The existence of “//” within the


URL path means that the user will be redirected to another website.

An example of such URL’s is:


“http://www.legitimate.com//http://www.phishing.com

We examine the location where the “//” appears. We find


that if the URL starts with “HTTP”, that means the “//”
should appear in the sixth position. However, if the URL
employs “HTTPS” then the “//” should appear in seventh
position.

Count of 1 in dataset  9626


Count of -1 in dataset  1429
Prefix_Suffix  The dash symbol is rarely used in legitimate
URLs. Phishers tend to add prefixes or suffixes separated by (-) to
the domain name so that users feel that they are dealing with a
legitimate webpage.
For example : http://www.Confirme-paypal.com/
Count of 1 in dataset  1465
Count of -1 in dataset  9590

Having_Sub_Domain  A sub-domain is a domain that is part


of a larger domain; the only domain that is not also a sub-domain
is the root domain. For example, west.example.com and
east.example.com are sub-domains of the example.com domain,
which in turn is a sub-domain of the com top-level domain (TLD).
Let us assume we have the following link:

http://www.hud.ac.uk/students/.
A domain name might include the country-code top-level
domains (ccTLD), which in our example is “uk”. The “ac” part is
shorthand for “academic”, the combined “ac.uk” is called a
second-level domain (SLD) and “hud” is the actual name of the
domain. To produce a rule for extracting this feature, we firstly
have to omit the (www.) from the URL which is in fact a sub
domain in itself. Then, we have to remove the (ccTLD)

Count of -1 in dataset  3363


Count of 0 in dataset  3622
Count of 1 in dataset  4070

SSLfinal_State  The Secure Sockets Layer / Transport Level


Security system that underpins secure connections on the Web
does more than just scramble information.
The Secure Sockets Layer / Transport Level Security
system that underpins secure connections on the Web does
more than just scramble information. It also checks the
identities of sites to which you securely connect to ensure
that they are who you say they are. Those proofs of
identities, called certificates, get stored in your computer's
memory until you restart it or clear the SSL state.
An SSL Certificate will be stored on your computer’s cache
until you have restarted your computer. This means that as
long as you use your computer and does not shut it down,
the credentials stored in the cache is still there; making it
possible for the SSL program to access the website more
easily.
Well SSL has nothing to do with detection of
a phishing website nothing at all. SSL (newer version
of SSL is called TLS) simply ensures safe or
encrypted connection between client (your browser) and the
server (the website you are visiting).
Count of -1 in dataset  3557
Count of 0 in dataset  1167
Count of 1 in dataset  6331

Domain_registration_length  Based on the fact that a


phishing website lives for a short period of time, It was believed that
trustworthy domains are regularly paid for several years in
advance. But in the dataset, we find that the longest fraudulent
domains have been used for one year only.
Count of 1 in dataset  3666
Count of -1 in dataset  7389

Favicon  A favicon is a graphic image (icon) associated with a


specific webpage. Many existing user agents such as graphical
browsers and newsreaders show favicon as a visual reminder of the
website identity in the address bar.
If the favicon is loaded from a domain other than that shown
in the address bar, then the webpage is likely to be
considered a Phishing attempt
Count of 1 in dataset  9002
Count of -1 in dataset  2053

Port  This feature is useful in validating if a particular service


(e.g. HTTP) is up or down on a specific server. In the aim of
controlling intrusions, it is much better to merely open ports that
you need. Several firewalls, Proxy and Network Address Translation
(NAT) servers will, by default, block all or most of the ports and only
open the ones selected.  If all ports are open, phishers can
run almost any service they want and as a result, user information
is threatened.
Count of -1 in dataset  1502
Count of 1 in dataset  9553
HTTPS_token  The phishers may add the “HTTPS” token to the
domain part of a URL in order to trick users.
For example:
http://https-www-paypal-it-webapps-mpp-home.soft-
hair.com/
Count of -1 in dataset  1796
Count of 1 in dataset  9259

Request_URL  Request URL examines whether the external


objects contained within a webpage such as images, videos and
sounds are loaded from another domain. In legitimate webpage’s,
the webpage address and most of objects embedded within the
webpage are sharing the same domain.
Count of -1 in dataset  4495
Count of 1 in dataset  6560
URL_of_Anchor  An anchor is an element defined by the <a>
tag. If the URL of the anchor<a> has a maximum number of links
then it is considered as Phishing URL.This feature is treated exactly
as “Request URL”. However, for this feature we examine:
If the </a><a> tags and the website have different domain
names. This is similar to request URL feature.
If the anchor does not link to any webpage, For Example
1. </a><a href="“#”">
2. </a><a href="“#content”">
3. </a><a href="“#skip”">
4. </a><a href="“JavaScript">
Count of -1 in dataset  3282
Count of 0 in dataset  5337
Count of 1 in dataset  2436
Links_in_tags  Given that our investigation covers all angles
likely to be used in the webpage source code, we find that it is
common for legitimate websites to use tags to offer metadata about
the HTML document
Count of -1 in dataset  3956
Count of 0 in dataset  4449
Count of 1 in dataset  2650

SFH  Server Form Handler (SFH) that contain an empty string


or “about:blank” are considered doubtful because an action should
be taken upon the submitted information.
In addition, if the domain name in SFHs is different from the
domain name of the webpage, this reveals that the webpage
is suspicious because the submitted information is rarely
handled by external domains.

Count of -1 in dataset  8440


Count of 0 in dataset  761
Count of 1 in dataset  1854
Submitting_to_email  Phisher might use “ mail() ” or “mailto:”
functions to redirect the user’s information to his personal email.
Web form allows a user to submit his personal information that is
directed to a server for processing. A phisher might redirect the
user’s information to his personal email.
To that end, a server-side script language might be used
such as “mail()” function in PHP. One more client-side
function that might be used for this purpose is the “mailto:”
function.
Count of -1 in dataset  2014
Count of 1 in dataset  9041

Abnormal_URL  Abnormal usage of the website can be detected


by the Abnormal URL traffic for software service user alert,
where we calculate the number of URLs loaded by any single user,
compare this to the number of URLs with sensitive information
(price pages in our case), and raise an alert if this exceeds the
threshold.
This feature can be extracted from WHOIS database. For a
legitimate website, identity is typically part of its URL.
WHOIS (pronounced "who is") is an Internet service used to
look up information about a domain name.
Count of -1 in dataset  1629
Count of 1 in dataset  9426

Redirect  The fine line that distinguishes phishing websites


from legitimate ones is how many times a website has been
redirected.
URL Redirect (also referred to as URL Forwarding) is a
technique which is used to redirect your domain's visitors to
a different URL. You can forward your domain name to any
website, webpage, etc. which is available
online. Redirects use status codes defined within the HTTP
protocol.
In our dataset, we find that legitimate websites have been
redirected one time max. On the other hand, phishing
websites containing this feature have been redirected at least
4 times.
Count of 0 in dataset  9776
Count of 1 in dataset  1259
On_mouseover  Phishers may use JavaScript to show a fake
URL in the status bar to users. To extract this feature, we must dig-
out the webpage source code, particularly the “onMouseOver”
event, and check if it makes any changes on the status bar.
Count of -1 in dataset  1315
Count of 1 in dataset  9740

Right Click  Phishers use JavaScript to disable the right-click


function, so that users cannot view and save the webpage source
code.
This feature is treated exactly as “Using onMouseOver to
hide the Link”.
Nonetheless, for this feature, we will search for event
“event.button==2” in the webpage source code and
check if the right click is disabled.
Count of -1 in dataset  476
Count of 1 in dataset  10579
Pop Up Window  It is unusual to find a legitimate website
asking users to submit their personal information through a pop-
up window. On the other hand, this feature has been used in some
legitimate websites and its main goal is to warn users about
fraudulent activities or broadcast a welcome announcement,
though no personal information was asked to be filled in through
these pop-up windows.
Count of -1 in dataset  2137
Count of 1 in dataset  8918

Iframe  IFrame is an HTML tag used to display an additional


webpage into one that is currently shown. Phishers can make use
of the “iframe” tag and make it invisible i.e. without frame borders.
In this regard, phishers make use of the “frame Border” attribute
which causes the browser to render a visual delineation.
Count of -1 in dataset  1012
Count of 1 in dataset  10043

Age_of_domain  In simple terms, “Domain Age” refers to the


amount of time during which a domain name has existed. It is how
old a domain name is. So for example, if a domain name was
registered in 2010, the domain age will be 10 years by 2020.
Most phishing websites live for a short period of time. By
reviewing our dataset, we find that the minimum age of
the legitimate domain is 6 months.
Count of -1 in dataset  5189
Count of 1 in dataset  5866

DNS Record  Records are the most basic type of DNS


record and are used to point a domain or sub-domain to an IP
address.
Assigning a value to an A record is as simple as providing
your DNS management panel with an IP address to where
the domain or sub-domain should point and a TTL.
TTL - 'Time To Live' value indicates the amount of time the
record is cached by a DNS Server, such as your Internet
service provider. The default (and lowest accepted) value is
14400 seconds (4 hours).
For phishing websites, either the claimed identity is not
recognized by the WHOIS database (WHOIS 2005) or no
records founded for the hostname (Pan and Ding 2006). If
the DNS record is empty or not found then the website is
classified as “Phishing”, otherwise it is classified as
“Legitimate”.
Count of -1 in dataset  3443
Count of 1 in dataset  7612

Web_traffic  This feature measures the popularity of the


website by determining the number of visitors and the number of
pages they visit. However, since phishing websites live for a short
period of time, they may not be recognized by the Alexa database
(Alexa the Web Information Company., 1996).
By reviewing our dataset, we find that in worst scenarios,
legitimate websites ranked among the top 100,000.
Furthermore, if the domain has no traffic or is not
recognized by the Alexa database, it is classified as
“Phishing”. Otherwise, it is classified as “Suspicious”
Count of -1 in dataset  2655
Count of 1 in dataset  5831
Count of 0 in dataset  2569
Page_Rank  Page Rank is a value ranging from “0” to “1”. Page
Rank aims to measure how important a webpage is on the Internet.
The greater the Page Rank value the more important the webpage.
In our datasets, we find that about 95% of phishing web
pages have no Page Rank. Moreover, we find that the
remaining 5% of phishing web pages may reach a Page
Rank value up to “0.2”.
Count of -1 in dataset  8201
Count of 1 in dataset  2854

Google_Index  This feature examines whether a website is in


Google’s index or not. When a site is indexed by Google, it is
displayed on search results (Webmaster resources, 2014).
Usually, phishing web pages are merely accessible for a
short period and as a result, many phishing web pages
may not be found on the Google index.
Count of -1 in dataset  1539
Count of 1 in dataset  9516

Links_pointing_to_page  The number of links pointing to the


webpage indicates its legitimacy level, even if some links are of the
same domain (Dean, 2014). In our datasets and due to its short life
span, we find that 98% of phishing dataset items have no links
pointing to them. On the other hand, legitimate websites have at
least 2 external links pointing to them.
Count of -1 in dataset  1539
Count of 1 in dataset  9516
Statistical_report  Several parties such as Phish Tank ( Phish
Tank Stats, 2010-2012 ), and Stop Badware ( Stop Badware, 2010-
2012 ) formulate numerous statistical reports on phishing websites
at every given period of time; some are monthly and others are
quarterly.
In our research, we used 2 forms of the top ten statistics from
Phish Tank: “Top 10 Domains” and “Top 10 IPs” according to
statistical-reports published in the last three years, starting in
January2010 to November 2012. Whereas for “Stop Badware”, we
used “Top 50” IP addresses.
Count of -1 in dataset  1550
Count of 1 in dataset  9505

Result  Here in the dataset the values


1 indicates legitimate
0 indicates Suspicious
-1 indicates Phishing

Count of -1 in dataset  4898


Count of 1 in dataset  6157
5. FLOWCHART

Detecting Phishing Domains is a classification problem, so it


means we need labelled data which has samples as phish
domains and legitimate domains in the training phase. The
dataset which will be used in the training phase is a very
important point to build successful detection mechanism. We
have to use samples whose classes are precisely known. So it
means, the samples which are labelled as phishing must be
absolutely detected as phishing. Likewise the samples which are
labelled as legitimate must be absolutely detected as legitimate.
Otherwise, the system will not work correctly if we use samples
that we are not sure about.
Collecting legitimate domains is another problem. For this
purpose, site reputation services are commonly used. These
services analyse and rank available websites. This ranking may
be global or may be country-based. Ranking mechanism
depends on a wide variety of features. The websites which have
high rank scores are legitimate sites which are used very
frequently. One of the well-known reputation ranking service
is Alexa. Researchers are using top lists of Alexa for legitimate
sites.

When we have raw data for phishing and legitimate sites, the
next step should be processing these data and extract
meaningful information from it to detect fraudulent domains.
The dataset to be used for machine learning must actually
consistent these features. So, we must process the raw data
which is collected from Alexa, Phishtank or other data resources,
and create a new dataset to train our system with machine
learning algorithms. The feature values should be selected
according to our needs and purposes and should be calculated
for every one of them.

5. RESULT

Avg RUC
Name of Train Accuracy Precision AUC Recall
the and
Model Test set
Score Score Score Score

KNN 80% Train 0.6440524 0.6371408 0.6399808 0.6789516


20% Test 649479873 82202339 8998088 78951679

Logistic 80% Train 0.9208502 0.9025535 0.9200334 0.9274653


20% Test 939846224 917406924 9434472 626731867
Regression

0.5535956 1.0 0.5535956


SVM 80% Train
5807327
- 5807327
classifiers 20% Test

Naive 80% Train 0.8701944 0.7927589 0.8740781 0.9474689


20% Test 821347807 611543011 220768 589302769
Bayes

Decision 80% Train 0.9525101 0.9374745 0.9516324 0.9604938


20% Test 763907734 816044227 557489 271604938
Tree
Random
Forest 80% Train 0.9719583 0.9635166 0.9715470 0.9754500
20% Test 898688376 399405279 8338366 818330606
ADA
Boosting 80% Train 0.9086386 0.8868814 0.9075162 0.9181669
20% Test 250565355 312605993 3008577 394435352

KNN 70% Train 0.6186312 0.6105690 0.6143091 0.6642619


30% Test 93337353 290211623 834155 311875694

Logistic 70% Train 0.9207114 0.9073423 0.9202144 0.9239361


30% Test 862827857 316166516 3166170 70212766
Regression

0.5598432 1.0 0.5598432


SVM 70% Train
318359964
- 318359964
classifiers 30% Test

Naive 70% Train 0.8721736 0.7993721 0.8746987 0.9475327


30% Test 508893578 469821239 9693489 920049968
Bayes

Decision 70% Train 0.9511606 0.9397785 0.9505710 0.9554003


30% Test 873681037 468403775 4032453 224073079
Tree

Random 70% Train 0.9674404 0.9626485 0.9677522 0.9654071


30% Test 582454025 441706522 4444689 314529005
Forest

ADA 70% Train 0.9095568 0.8967378 0.9092095 0.9116869


30% Test 284594513 034729345 2797001 381279746
Boosting

KNN 90% Train 0.9132007 0.8917375 0.9123335 0.9210526


10% Test 233273056 88474851 447051 315789473

Logistic 90% Train 0.6699819 0.6412520 0.6673549 0.7055837


10% Test 168173599 218573163 8496870 563451777
Regression

SVM 90% Train 0.5497287 1.0 0.5497287


classifiers 10% Test 522603979 - 522603979

Naive 90% Train 0.8553345 0.7657963 0.8621772 0.9426877


10% Test 388788427 637360525 0685112 470355731
Bayes
Decision 90% Train 0.9575045 0.9428875 0.9567421 0.9651741
10% Test 207956601 774359383 34259 293532339
Tree
Random
Forest 90% Train 0.8987341 0.8713844 0.8974483 0.9119601
10% Test 772151899 35570085 204134367 328903655

7. ADVANTAGES & DISADVANTAGES

7.1 ADVANTAGES:

This system can be used by many E-commerce , E-Banking or


other websites in order to have good customer relationship.

User can make online payment securely.

Data mining algorithm used in this system provides better


performance as compared to other traditional classifications
algorithms.

With the help of this system user can also purchase products
online without any hesitation.

Build secure connection between users mail transfer agent (MTA)


and Mail User Agent (MUA).

7.2 DISADVANTAGES:

Rules require manual adjustments. Does not look at content.

Hide false positives.

Did not use content from body of the email. Susceptible to short
lived phish domains. Users do not pay attention to warnings.
Not all email client are browser based.

Present time high. Susceptible to screen resolution.


Phish domains are short lived. Does not look at email content.
Specific website tool.

8. APPLICATIONS

The popularity of applications on social networking websites has


increased a great deal this year. This has led to a new wave of
phishing attacks targeting the users of these applications.
Symantec has examined phishing websites exploiting three
major social networking brands. The fake websites display
attractive offers on the social networking applications to lure end
users. Some of the applications that the phishing sites were
based on are:

1. Social networking on mobile – Due to the rise in the number of


users accessing the Internet through smart phones, social
networking websites have expanded their services on smart phones,
including messaging, chatting, photo viewing, etc. This increase in
users has opened more doors to attackers because there are now
more potential victims. Hence, attackers have created phishing
websites on social networking brands claiming to provide these
services on smart phones.
2. Live chat – In November, Symantec observed that five percent of
the targeted applications were on live chat, and among them adult
sex chat was the most common target. The phishing attacks show
fake offers of free sex chat to lure end users into entering their login
credentials.

3. Blogging – Phishing websites that attacked blogging in social


networking comprised 23 percent of all targeted applications.
Various attractive blog topics are used in the login pages of the
phishing site as a means to con end users. Pornographic material is
one of the most common topics observed in these phishing
attempts.

4. Gaming – In 2009, gaming has become an increasingly popular


aspect of social networking. Symantec evaluated gaming and found
that it comprised 13 percent of the targeted applications. Gaming
applications in social networking generally require various kinds of
credit points to progress to higher levels of the game. Some of these
credit points typically require online payment. The phishing
websites trick users by providing fake offers of free credit points on
these gaming applications.

Internet users are advised to follow best practices to avoid phishing


attacks. Here are some basic tips on avoiding online scams:

• Do not click on suspicious links from emails.


• Check the URL of the website and make sure that it belongs to
the brand.
• Type the domain name of your brand directly in your browser
rather than following any link.
• Frequently update your security software, such as Norton
Internet Security 2009, which protects you from online phishing.

9. CONCLUSION
The main objective of this study is to help the users to
differentiate between the phishing and legitimate URLs by
inspecting the URLs based on particular unique characteristics.
This research demonstrates the capability to recognize fake web
pages based on their URLs. In order to protect the victim from
phishing attacks, educational awareness programs must be
conducted.

All internet users must follow the security tips given by


educational experts during the awareness programs. Users
should also be very well trained to recognize a fake website so
that they would not enter their personal details believing it is a
legitimate website. It is mandatory to inspect the URL link before
entering into any website.

The most important way to protect the user from phishing attack
is the education awareness. Internet users must be aware of all
security tips which are given by experts.

This problem can be easily solved by using any of the machine


learning algorithm with the classifier. We already have classifiers
which gives good prediction rate of the phishing beside, but after
our survey that it will be better to use a hybrid approach for the
prediction and further improve the accuracy prediction rate of
phishing websites. We have seen that existing system gives less
accuracy so we proposed a new phishing method that employs
URL based features and also we generated classifiers through
several machine learning.

10. FUTURE SCOPE


In future work, automatic detection of web pages and the web
browser extension can be done. Further work can also be done by
adding several other features to distinguish the fake web page from
a legitimate web page.

There are many features that can be improved in the work, for
various other issues. The heuristics can be further developed to
detect phishing attacks in the presence of embedded objects like
flash. Identity extraction is an important operation and it was
improved with the Optical Character Recognition (OCR) system to
extract the text and images.

11. BIBILOGRAPHY
 https://www.ijert.org/detection-of-url-based-phishing-attacks-
using-machine-learning

 https://www.ijeat.org/wp-
content/uploads/papers/v8i2s/B11031282S18.pdf
 https://towardsdatascience.com/phishing-domain-detection-
with-ml-5be9c99293e5

 https://www.hindawi.com/journals/scn/2017/5421046/
 http://phishtank.com
 https://blogger.com
 https://www.alexa.com/
 https://en.wikipedia.org/wiki/Phishing
 Almomani, Ammar, et al. "Evolving fuzzy neural network for
phishing emails detection." Journal of Computer Science 8.7
(2012): 1099.
 Sananse, Bhagyashree E., and Tanuja K. Sarode. "Phishing URL
Detection: A Machine Learning and Web Signals, Controls and
Computation (EPSCICON), 2012 International Conference on.
IEEE, 2012
 Jain, A. K., & Gupta, B. B. (2016, March). Comparative analysis
of features based machine learning approaches for phishing
detection. In 2016 3rd International Conference on Computing
for Sustainable Global Development (INDIA Com) (pp. 2125-
2130). IEEE.

APPENDIX

Logistic Regression

K-nearest Classifier
Decision Tree

Random Forest

SVM
Ada boosting

Naïve Bayes

So amoung all the algorithms RANDOM FOREST


REGRESSOR is best suited for the Detection of Phishing
website in Machine Learning.

You might also like