You are on page 1of 7

PHISHING DETECTION SYSTEM

THROUGH HYBRID MACHINE LEARNING BASED ON URL

Department of Computer Science and Engineering, DNR College of Engineering and Technology,
Bhimavaram.
---------------------------------------------------------------------***--------------------------------------------------------------------
ABSTRACT
Online phishing is one of the most common attacks on the modern internet. The goal of
phishing website uniform resource locators is to steal personal data including login
credentials and credit card numbers. As technology keeps growing, phishing strategies
began to develop rapidly.
Machine learning built an effective device used to attempt phishing attacks. In this project,
we have built a phishing website by using fast API. We have used two so many different
libraries and two algorithms which are logistic regression and multimodal NP. The
purpose of this project is to check whether phishing websites are good URLs or bad URLs.
We gathered data to create a dataset of malicious links and curate it for the machine
learning model.
Keywords—Phishing; URL features; machine learning; phishing detection

INTRODUCTION
In modern era Phishing becomes a main area of concern for security researchers due to the fact it
is not tough to create the fake internet site which looks so close to legitimate internet site. Experts
can discover fake website7s however not all the customers can discover the fake website and such
customers become the victim of phishing attack. Main purpose of the attacker is to steal banks
account credentials. How hackers do their work, they send you just spam mail. In this mail though
they will say that this email is mean to inform you that you’re my university network password
will expire in 24 hours and they have provide you to update the password and login when we click
on that link we will redirect to that page which is a hacker server and they will be steal your data
everything which is online.
In our project we have to predict phishing websites whether they are good uniform resource
locators (URLs) or bad URLs. The set of phishing URLs are gather from open source service
called Phish Tank. Benign URLs (uniform resource locator) with zero malicious detection were
classified as benign and URLs with no less than eight detection were classified as malicious. It is
being labeled as ‘0’ and Phishing URL is being labeled as ‘1’. We study several machine learning
algorithm for analysis of the characteristic in order to get a good understanding of the construction
of the URLs that expand phishing. Phishing attacks are getting a success because lack of consumer
awareness. Since phishing attack exploits the weaknesses found in customers, it's far very tough
to mitigate them however it may be very vital to enhance phishing detection strategies. The general
technique to discover phishing web sites through updating blacklisted URLs, Internet Protocol (IP)
to the antivirus database which is also recognized as “blacklist" technique. To evade blacklists
attackers makes use of innovative techniques to fool customers through modifying the URL to
appear valid via obfuscation and lots of other easy techniques such as: fast-flux, in which proxies
are automatically generated to host the web-page; algorithmic era of recent URLs etc.
To avoid getting phished, people should have an understanding of phishing websites and how they
look if the person is using an online browser. So, the high-end companies can even blacklist
phishing websites or detect phishing in their early arrival, using machine learning and deep neural
network algorithm to build a model that can classify URLs as phishing. Machine learning
technique is proven to be efficient than the other technique.

PROPOSED SYSTEM
The suggested system will have a client-server design. On the client side, a chrome extension will
be used to send the Uniform Resource Locator (URL) and Web page source attribute to the server
that the user is presently visiting. A cloud-based model for phishing site detection will be
constructed on the server side which is trained using random forest algorithm.

Training Dataset:
The dataset needed to train the model should be large, with two distinct classes: legitimate and
phishing. Furthermore, the dataset should include a balanced mix of legitimate and phishing sites.
Phish Tank will mostly be the source of the phishing URLs. For legitimate pages, Alexa, Statista,
and Similar web can be used to get pages with high traffic and better ranking as these pages will
have a very low possibility to be phishing web pages. It is because malicious websites will have
less traffic and lower ranking on search engines due to their limited life span. As a result, a dataset
with 40000 URLs can be formed.

Feature Selection:
Feature selection is an important process because it has a large impact on model accuracy in the
real world. The process of selecting the best set of features for model training is known as feature
selection. In proposed system features used are URL based and Content based features.

URL based features:


These are features that are obtained from the URL that which user is currently visiting.
1)Protocol check: To check if protocol used is "https".
2)Word count: After parsing URL through special characters words are counted.
3)Average word length : Average length of words obtained after parsing.
4)Character count: Total number of characters present in URL.
5)Digit count: Total number of digits present in URL.
6)Special characters count: Total number of special characters present in URL.
7)Keyword count: Keywords like login, gift, secure, etc. count.
8)Brand name count: Keywords like facebook, gmail, etc. count.
9)Look alike keywords count: Keywords like login, secure, etc.
10)Look alike brand name count: Keywords like facebook, instagram.
11)Random words : Words which are not keywords and brand name.
12)Length of file path : The length calculated for path.
13)Top level domain check: Verify if Top level domain is most widely used domain like : com,
edu, org, etc.
14)Occurance of Subdomain : Occurance of Subdomain are usually more in malacious site.

Content based features:


These features can be derived from the source code of the page which user wants to access.
Features are:
1) Word count: Total number of text words present on web page.
2) Average word length: Average length of text words present on web page.
3) Links Count: Total number of links present in web page.
4) Iframe tag count: Total number of Iframe tags present in web page.
5) Embed tag count: Total number of embed tags present in web page.
6) Common Phishing word count: Words like pay, bonus, free, access, log, etc. count.

Data Pre-processing:
Raw data is transformed into usable formats during data preprocessing. Decomposers can be used
on URLs to split it and extract the necessary parts in order to obtain attributes such as brand name,
portocol, etc. The most well known and frequently used brand names and keywords are gathered
and checked for their presence in URLs. To extract URL-based features, the URL visited by the
user is split into words using special characters. After that, brand name and keyword checks are
performed on the obtained words. If a splitted word is not found in both dictionaries, it is sent to a
word decomposer, which can split two adjacent words in a string into two separate words. Word
decomposer firstly creates substrings of the input string passed. Then a dictionary check is made
on the obtained sub strings to know the words present. If it is unable to separate, then the word’s
similarity to available brand names and keywords is examined. Still, if there is no similarity, then
it is treated as a random word. Depending on the status of a word under review appriopriate
features are incremented. For content based feature web crawling can be done to get the value for
the features.

Classifiers:
With the data set created, multiple classifiers were trained. The Random Forest algorithm was the
most accurate. Random forests are an ensemble learning method for Classification, Regression,
and other tasks that operate by constructing a multitude of decision trees at training time. The
trained model will be deployed using cloud services in the proposed system.

Classifier Accuracy

Naive Bayes 91.9832

Support Vector Machine 94.5901

Neural Network 96.3394

Random Forest 97.3659

K-Nearest Neighbor 97.1384


PERFORMANCE EVALUATION METRICS
To evaluate the efficiency of a system, we use certain parameters. For each machine learning
model, we calculate the Accuracy, Precision, Recall, F1 Score and ROC curve to determine its
performance. Each of these metrics is calculated based on True Positive (TP), True Negative (TN),
False Positive (FP) and False Negative (FN). In the case of URL classification, True Positive (TP)
is the number of phishing URLs that are correctly classified as phishing. True Negative (TN) is
the number of legitimate URLs that are correctly classified as legitimate. False Positive (FP) is the
number of legitimate URLs that are classified as phishing. False Negative (FN) is the number of
phishing URLs that are classified as legitimate. These values are summarized called Confusion
Matrix.

Predicted Phishing Predicted Legitimate

Actual Phishing TP FN

Actual Legitimate FP TN

Precision is the number of URLs that are actually phishing out of all the URLs predicted as
phishing. It measures the classifier’s exactness. The formula to calculate precision is given by
Equation (1) below.

(1)

Recall is the number of URLs that the classifier identified as phishing out of all the URLs that are
actually phishing. It is also called sensitivity or true positive rate. It is an important measure and
should be as high as possible. The formula to calculate recall is given by Equation (2) below.

(2)

F1-Score is the weighted average of precision and recall. It is used to measure precision and recall
at the same time. The formula to calculate F1-Score is given by Equation (3) below.

(3)

Accuracy is the number of instances that were correctly classified out of all the instances in the
test data. The formula to calculate accuracy is given by Equation (4) below.
(4)

Receiver Operating Characteristic (ROC) curve is an important evaluation metric for binary
classification models. The curve is plotted with True Positive Rate (TPR) on the yaxis and False
Positive Rate (FPR) on the x-axis. The Area Under the ROC curve shows how well a classifier is
able to distinguish between phishing and legitimate URLs. The formula to calculate FPR and TPR
are given by Equation (5) and Equation (6) respectively.

(5)

(6)

CONCLUSION
In our project we have used Fast API which is a python framework and import many libraries for
different purposes. We have taken two algorithms which is Logistic Regression and Multinomial
NB. Logistic Regression will predict the links are good or not and Multinomial NB work well with
NLP data (natural language process). Then we have used some classification problems by using
Count Vectorizer and tokenizer. We have used some another visualization. We can show that what
is the hidden link in the phishing site which will redirect to another server. Then we have networks
it is creating a data structure, dynamic function and more. We are combining three datasets which
we collected from several sites then we combine this dataset into one frame. The usability of this
dataset is 10.0 which means very good. The data size is approx.. 30 mb. The data contains more
than 5 lakhs unique approach. The label column means that its prediction column in which there
were two categories first is good and second is bad. After that we have checked the imbalanced of
target column. Now we have a data, we convert URLs into vector form. We have used regular
expression tokenizer which divide the string using regular expression. So, in our code we are just
splitting only alphabets and some URLs have numbers, dots, slash etc which are not important our
data. So we only gather the string and simultaneously we have transformed this in all the rows.
After converting into words we used snowball it’s an NLTK API (natural language toolkit) which
is used to string words. It will remove all the English works and create some root words. Root
words means that it will combine the common words like pictures, photos for this two words it
will create the one word. Phishing data text streamer is equal to all the holder that list is words
lists are converted into streamers. Then we join all the lists words into single sentence. We also
use word cloud. In our code we have used this to convert most repeated word into the word cloud
form. Then we use chrome web driver. This will create a new window of that chrome. So to this
new chrome we will pass that link. Then by using beautiful Soup, we gather the all html code
from its page source and it is getting all the anchor tags. So we will get the entire hidden link which
will hacker use to redirect any users to this server and we create a data frame of this links. So it
will give a two links: first is what we passed to this and second what are we getting from this link.
Logistic regression object and we fit it by trainX, trainY. After that we checked the score and we
are getting very good score which is 90.96. After that we just created the confusion matrix to see
the actual prediction and normal prediction. Using Logistic Regression we are creating a pipeline.
Then we are saving this pipeline model using pickle and we check the accuracy of it and it is giving
very good accuracy.

REFERENCES
[1] Mehek Thakera, Mihir Parikhb, Preetika Shettyc, Vinit Neogid, Shree Jaswale. "Detecting
phishing websites using Data Mining" 2018.
[2] Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi. "Malicious URL detection using Machine
Leaming: A Survey" 2019.
[3] M. Amaad Ul Haq Tahir, Sohail Asghar, Ayesha Zafar, Saira Gillani. "Hybrid model to detect
phishing sites using Supervised Learning Algorithms" 2016.
[4] Srushti Patil, Sudhir Dhage. "A Methodical Overview on Phishing Detection along with an
Organized Way to Construct an AntiPhishing Framework" 2019.
[5]MicrosoftContributors.Phishing[online]Available:
https://docs.microsoft.com/enus/windows/security/threatprotection/intelligence/phishing
[6]WikipediaContributors.BrowserExtension[online]Available:
https://en.wikipedia.org/wiki/Browser_extension

You might also like