Sudeep Seminar Final

A seminar report on
“Web Crawling based Phishing attack Detection”
Submitted in partial fulfilment requirements for the award of the Degree
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
Sudeep Rao 4NM18IS121
Under the Guidance of
Mr. Abhishek S Rao

Assistant Professor GD-II
Department of Information Science and Engineering
Department of Information Science and Engineering
NMAM Institute of Technology, Nitte 2021– 2022

CERTIFICATE
This is to certify that Sudeep Rao, 4NM18IS121, a bonafide student of NMAM Institute of Technology, Nitte
has submitted the seminar report for the seminar entitled " Web Crawling based Phishing attack Detection "
in partial fulfilment of the requirements for the award of Bachelor of Engineering in Information Science and
Engineering during the year 2021-22. It is verified that all corrections / suggestions indicated for internal
assessment have been incorporated in the report deposited in the departmental library. The technical seminar
report has been approved as it satisfies the academic requirements in respect of seminar work prescribed by
Bachelor of Engineering degree.
Signature of the Guide Signature of the Signature of the

Seminar Mentor HOD
Mr. Abhishek S Rao Dr Ashwini B & Dr. Dr Karthik Pai B H
Manjula Gururaj Rao
DECLARATION
I hereby declare that the entire work embodied in this Seminar report titled “Web Crawling based Phishing
attack detection” has been carried out by us at NMAM Institute of Technology, Nitte under the supervision
of Dr. Ashwini B and Dr. Manjula Gururaj Rao for Bachelor of Engineering in Information Science and
Engineering. This report has not been submitted to this or any other University for the award of any other
degree.
SUDEEP RAO
4NM18IS121
Department of ISE
NMAMIT
Nitte
ACKNOWLEDGEMENT
Any achievement, be it scholastic or otherwise does not depend solely on the individual efforts but on the
guidance, encouragement and cooperation of intellectuals, elders and friends. A number of personalities, in
their own capacities have helped me in carrying out this project work. I would like to take this opportunity to
thank them all.
First and foremost I would like to thank Dr. Niranjan N Chiplunkar, Principal, NMAMIT, Nitte, for his
moral support towards completing my seminar work.
I would like to thank Dr. Karthik Pai B. H, Head of the Department, Information Science & Engineering,
NMAMIT, Nitte, for his valuable suggestions and expert advice.
I also extend my cordial thanks to Seminar Coordinators, Dr. Ashwini B and Dr. Manjula Gururaj Rao for
their support and guidance.
I deeply express my sincere gratitude to my guide Mr. Abhishek S Rao, Assistant Professor Gd-II,
Department of ISE, NMAMIT, Nitte, for his able guidance, regular source of encouragement and assistance
throughout this seminar work.
I thank my Parents, and all the Faculty members of the Department of Information Science & Engineering for
their constant support and encouragement.
Last, but not the least, I would like to thank my peers and friends who provided me with valuable suggestions
to improve my seminar work.
i
ABSTRACT
Phishing attacks are the practice of sending fraudulent communications that appear to come from a reputable
source. The goal is to steal sensitive data like credit card and login information, or to install malware on the
victim’s machine. Sometimes attackers are satisfied with getting a victim’s credit card information or other
personal data for financial gain. Other times, phishing emails are sent to obtain employee login information or
other details for use in an advanced attack against a specific company. Cybercrime attacks such as advanced
persistent threats (APTs) and ransomware often start with phishing. Currently, hardware based approaches for
anti-phishing are widely used but due to the cost and operational factors, software based approaches are
preferred. The existing phishing detection approaches fail to provide solutions to problems like zero-day
phishing website attacks. The proposed methodology, web crawling based phishing attack detection(WC-
PAD) is the most efficient way of tackling this sort of cyber threat.
ii
TABLE OF CONTENTS
Acknowledgement i
Abstract ii
Table of Contents iii
List of Figures iv
List of Tables v
CHAPTERS Page No.

1. INTRODUCTION 1
2. LITERATURE SURVEY 2-3
3. METHODOLOGY 4
4. IMPLEMENTATION 5-8
4.1 DNS Blacklist
4.2 Web Crawler
4.3 Heuristic Analysis
5. RESULTS AND DISCUSSION 9 - 10
6. CONCLUSION 11
7. REFERENCES 12
LIST OF FIGURES
iii
WEB CRAWLING BASED PHISHING ATTACK DETECTION 2021-22
Figure
Title of Figures Page No.
No.
1. Figure 1 4
2. Figure 2 5
3. Figure 3 9
iv
LIST OF TABLES
Table
No. Title of Table Page No.
1. Table 1 9
2. Table 2 10
v
Chapter 1
INTRODUCTION
A cyber attack is an assault launched by cybercriminals using one or more computers against a single or
multiple computers or networks. A cyber attack can maliciously disable computers, steal data, or use a
breached computer as a launch point for other attacks. Cybercriminals use a variety of methods to launch a
cyber attack, including malware, phishing, ransomware, denial of service, among other methods.
India recorded 50,035 cases of cyber crime in 2020, with a 11.8% surge in such offences over the previous
year. By the end of 2021, cybercrime will cost the world $6 Trillion. By 2025, this figure will climb to $10.5
trillion. In 2020, 43% of C-Suite business leaders who reported a data breach cited human error as the second
major cause, and the average cost of such data breaches was $3.33 million. The first major cause of a data
breach was the deliberate theft or sabotage by external vendors. No matter the source, it took an average of
239 days to identify and contain such breaches.
Malware, especially ransomware, is an increasingly serious problem for organizations. In the first three
quarters of 2020, ransomware was involved in 21% of reported breaches, contributing to the exposure of
11.2% unknown data types and 10.4% known data types. In 2021, the ransomware industry is worth $14
billion.
Given the increase in remote work because of technology and the pandemic, cybersecurity breaches are on the
rise in 2021. One of the most prevalent and dangerous types of cybersecurity threats are spear phishing
attacks. Phishing attacks can affect anyone and infiltrate any size business and wreak havoc on a company’s
network. To stay on top of these attacks, keep in mind these shocking phishing attack statistics in 2021.
A phishing attack is a type of cyber threat or social engineering attack that largely targets email accounts. Bad
actors will capitalize on popular products, beliefs, and ongoing trends to pull off a sophisticated social
engineering attack through a phishing campaign. This cyber attack usually affects internet users in the form of
an email that asks an individual to click to confirm an account, fix an error on a common account, or log in to
a site using credentials. Known as social engineering attacks, phishing attacks are dangerous because they look
and operate similarly to common emails sent out by legitimate businesses.
Department of Information Science and Engineering, NMAMIT 1

Chapter 2
LITERATURE SURVEY
Phishing is a major threat to all sectors spread across the internet and since a lot of companies are establishing
their businesses on the internet, even the common people’s data is available on the internet, they all can be
victims of Phishing attacks. With the given statistics and data, the number of phishing attacks are increasing
and people on the internet are losing a lot of money and sensitive data to these hackers. Thus, measures to
prevent such activities should be developed. In this research of phishing attack detection, a lot of data and
statistics need to be collected from various researchers regarding the same.
A study by L. Wenyin and G. Huang [1]: In today’s cyber era, Internet of Things (IoT) based products are
increasingly adopted by users for various purposes. Traditionally, Phishing attacks were targeted toward
banking and financial systems. With the rise in usage of IoT, the attack surface increases. Along with IoT
specific attacks, attackers are targeting users with Phishing to steal passwords in order to gain access to IoT
devices like security cameras.
A study by Anti-phishing Group [2]: APWG saw 260,642 phishing attacks in July 2021, which was the
highest monthly in APWG’s reporting history. The number of phishing attacks has doubled from early 2020.
The software-as-a-service and webmail sector was the most frequently victimized by phishing in the third
quarter, with 29.1% of all attacks. Attacks against financial institutions and payment providers continued to be
numerous, and were a combined 34.9% of all attacks. Phishing against cryptocurrency targets –
cryptocurrency exchanges and wallet providers – settled at 5.6% of attacks. The number of brands being
attacked has risen during 2021, from just over 400 per month to more than 700 in September.
A study by R. Dhamija and J.D Tygar [3]: PhishTank and Openfish 5 were used, as well as 1,918 valid pages
from the Alexa database, as well as various online payment and banking services. To detect phishing
activities, Rajab suggests utilizing Correlation Feature Set (CFS) and Information Gain (IG) to choose the
most influential features [17]. The findings of the three UCI repositories with 30 features provided for 11,055
samples show that IG and CFS picked 11 and 9 characteristics, respectively. A data mining method called
RIPPER is used to evaluate the classification performance using selected characteristics.

A study by Phishing attacks article [4]: IBM X-Force’s 2021 Threat Intelligence Index found that phishing led
to 33% of cyber attacks organizations had to deal with. Phishing, an online threat that emerged in the mid-
1990s, today continues to be a top cyber crime practice that impacts brands and companies and is a prolific
initial compromise vector in nation-state attacks.Looking at phishing kits on the code level, IBM researchers
have analyzed over 40,000 phishing kits and deconstructed them to their basic elements. Phishing itself does
not merit much more — it’s a very short-lived form of online threat, typically lasting an average of 21 hours
from launch to takedown.
A study “Common phishing attacks”[5]: Dropbox Phishing is a type of phishing, where the attackers want to
access the files of the dropbox users, so they create a fake dropbox sign in page, which can be hosted on
dropbox itself and steal the users credentials. Google docs phishing is similar to dropbox phishing, where the
attackers aim to access the Google drive and its documents. Such an attack happened in 2015, the google page
not only hosted the fake login page but also gave an SSL certificate to protect the page with secure connection.
A study by C. Pham, E. Huh and C. S. Hong [7]: By deploying a gateway anti-phishing in the networks, these
current hardware-based approaches provide an additional layer of defense against phishing attacks. However,
such hardware devices are expensive and inefficient in operation due to the diversity of phishing attacks. With
promising technologies of virtualization in fog networks, an anti-phishing gateway can be implemented as
software at the edge of the network and embedded robust machine learning techniques for phishing detection.
By using uniform resource locator features and Web traffic features to detect phishing websites based on a
designed neuro-fuzzy framework (dubbed Fi-NFN).
A study by C. N. Gutierrez et al.,[8]: In this paper, analysis of the email structure is done. Then, based on an
improved recurrent convolutional neural networks (RCNN) model with multilevel vectors and attention
mechanism, a new phishing email detection model named THEMIS is proposed, which is used to model
emails at the email header, the email body, the character level, and the word level simultaneously. To evaluate
the effectiveness of THEMIS, an unbalanced dataset that has realistic ratios of phishing and legitimate emails
is used. The experimental results show that the overall accuracy of THEMIS reaches 99.848%. Meanwhile, the
false positive rate (FPR) is 0.043%. High accuracy and low FPR ensure that the filter can identify phishing
emails with high probability and filter out legitimate emails as little as possible.

Chapter 3
PROPOSED METHODOLOGY
Web Crawler based Phishing Attack Detector (WC-PAD) is a three phase phishing attack detection approach.
The three phases of WC-PAD include 1) DNS blacklist 2) Web crawler based approach 3) Heuristic based
approach. Here the web crawlers are used for both feature extraction and phishing attack detection. Figure 1
shows the overall architecture of the proposed WC-PAD.
Fig 1. Architecture of proposed WC-PAD

Chapter 4
IMPLEMENTATION
4.1 DNS BLACKLIST

The Domain Name System based Black list also known as DNS Blacklist is an approach of publishing a list of
IP addresses, which are meant to avoid and can be easily programmed on the internet. It is built on top of
Internet DNS. The DNS Blacklist provides the IP address which involves spam activity. The DNS Blacklist is
updated frequently. The WC-PAD extracts the web address and compares it with the address in the DNS
blacklist as first phase, if a match of IP address is found an immediate alert is sent to the webpage user else the
WC-PAD processed with its second phase.
4.2 WEB CRAWLER

Figure 2 shows the working mechanism of web crawlers.
Fig 2. Web crawler for web indexing

WC-PAD starts crawling to the websites interconnected pages and links. WC-PAD crawls from one website
to another going through all the links until all the web indexes are crawled. The proposed WC-PAD uses web
crawlers to crawl each web page of a website, since attackers do not index all the web links in the phished
website. Crawlers are also used to extract some features from the website. The proposed WC-PAD is used for
identifying the fault in web indexing. If any of the unmatched web index is found or any of the web pages or
links does not work, the WC-PAD alerts the user. Experiment analysis proves web crawlers are very effective
when it comes to zero-day phishing attack detection.
4.3 HEURISTIC ANALYSIS

Heuristic analysis takes three features Web content feature, URL feature, Web traffic feature all these features
are extracted by the web crawler. Algorithm 1 explains the overview of Heuristic Analysis. The three features
have separate analysis phases as follows
URL Analysis
The WC-PAD is designed in a way to extract the information form URLs all the interconnected URLs. The
URL is partitioned as follows ://../ . For example, consider the following URL: http://paypal.abc.net/index.htm
There are six elements in index.htm: http is the protocol, paypal is the SubDomain, abc is the PrimaryDomain,
net is the top-level domain (TLD), abc.net is the Domain, and index.htm is the PathDomain.
URL Analyzer checks for the occurrence of '@' and '-' in URL, since '@' in a URL means its left side can be
discarded and only 59 characters are taken from the right side. and Legal sites do not use '-' very often. The
URL of a legitimate site does not contain many dots, only the phishing websites contain many dots. So the
URL is also scanned for a number of dots. The URL analyzer also checks the dictionary words and reports on
the finding of misspelled words with levenshtein distance(ld). Levenshtein distance is used to calculate the
difference in two strings sequence. If the distance is less, there is a possibility of a phishing attack. URL
Analyzer will also check whether the given URL contains the IP and whether that IP is the IP of its domain.
Based on these Heuristics, the website will be classified as legal or phished. WC-PAD not only extracts the
URL of a website but also extracts all the interconnected URLs and crawls to the interconnected URL to find
whether those URLs are valid. The URL Analyzer performs a validation on all the interconnected links.
Algorithm 2 will explain the process in detail.

Heuristic analysis
Web content Analysis: The proposed WC-PAD has been programmed in a way to crawl through the web
page contents and copyrights in the website. Based on the crawled pages, the web content analyzer classifies a
web page contents to be legal or illegal. If the contents are classified as illegal, then the WC-PAD sends an
alert message to the user regarding the suspicion.
Web traffic Analyzer: The web traffic analyzer takes parameters such as total visits for websites, pages per
visit, average visit duration and the bounce rate. Based on these parameters, the website is classified as zero-
day phishing websites or normal websites. The Web traffic analyzer also takes the Google PageRank and
Alexa Reputation to identify the bounce rate of a website. PageRank[15,16] uses a link analysis algorithm of
Google search engine build to calculate PageRank values. The phishing website usually contains a low value,
since these kinds of websites exist only for a short time. AlexaReputation[17] value of a website is calculated
based on the count of links from other webpages to itself. It is similar to PageRank. The AlexaReputation
value will be low for phishing websites and higher AlexaReputation is similar to Pagerank, where
AlexaReputation values of phishing websites are much lower than the values of the legitimate sites. Algorithm
1 will explain the process in detail.


Chapter 5
RESULTS AND DISCUSSION
The phishing URLs are collected from the website PhishTank and the legal URLs are scraped from the legal
websites. The datasets are randomly divided in the ratio of 70:30 as legal and phished websites. The proposed
WC-PAD is programmed in Java with Selenium web driver for crawling through the pages and weblinks.
Initially, the WCPAD checks the availability of the IP in DNS Blacklist, if it doesn't occur, it goes to Web
Crawler phase, where both web indexing and feature extraction is done. If WC-PAD finds any fault in web
indexing the user is alerted, else WC-PAD proceeds with the third phase namely Heuristic Analysis, where the
web contents are analyzed for copyrights, URL features and web traffic features are analyzed and alert is
generated accordingly. Table 1 shows the allocation of weights for URL Features. The highest weight is given
to the interconnected links.
Figure 3. Phishing and legitimate website identification with 70:30 ratio
Table 1. Heuristics weight allocation

Based on the above weights, the website's state is computed and classified as a phished or non-phished
website. Table 2 shows the Accuracy, Sensitivity and specificity calculation of the proposed WC-PAD which
is as follows.
These are the formulae to calculate the accuracy, sensitivity and the specificity respectively.
Table 2. Performance of WC-PAD
The proposed WC-PAD detects the phishing websites accurately and classifies the legitimate websites and
phishing websites in a précised manner. WC-PAD produces a detection accuracy of 98.9% including both
phishing attacks and zero-day phishing attacks.

Chapter 6
CONCLUSION
WC-PAD for detecting phishing attacks has been proposed. WC-PAD performs a three phase identification
mechanism, Firstly, the DNS Blacklist based detection is done. Secondly a Web Crawler based detection is
accomplished followed by Heuristics based detection. The frequently used phishing IP are easily detected in
DNS blacklist testing. The zero-day phishing website attacks are identified in Web crawler and heuristic
analysis phase. The experimental analysis has been done for the proposed WC-PAD and it precisely detects
the phishing websites. WC-PAD produces a detection accuracy of 98.9% including both phishing attacks and
zero-day phishing attacks.
In future work; WC-PAD is planned to be optimised for better performance and the same can be implemented
upon all search-engines to increase the efficiency. Further research can be made to come up with a way to
implement WC-PAD on smartphones and tablets.

REFERENCES
[1] L. Wenyin, G. Huang, L. Xiaoyue, X. Deng, and Z. Min, “Phishing Web pagedetection,” in Proc. IEEE 8th
Int. Conf.Document Anal. Recognit., Seoul, South Korea, 2005, pp. 560–564
[2] Anti-Phishing Working Group. Accessed: Sep. 2018. [Online]. Available http://www.antiphishing.org
[3] R. Dhamija and J.D. Tygar, “The Battle against Phishing: Dynamic SecuritySkins”, Proc. Symp. Usable
Privacy and Security, Mobile Marketing Statistics. Mar. 2017.pp 77-88.
[4] Phishing Attacks. [Online]. Available: https://securityintelligence.com/
[5]https://www.tripwire.com/state-of-security/security-awareness/6-common-phishing-attacks-and-how-to-
protect-against-them/
[6] A. Naga Venkata Sunil and A. Sardana, "A PageRank based detection technique for phishing web sites,"
2017 IEEE Symposium on Computers & Informatics (ISCI), Penang, 2017, pp. 58-63.
[7] C. Pham, L. A. T. Nguyen, N. H. Tran, E. Huh and C. S. Hong, "Phishing-Aware: A Neuro-Fuzzy

Approach for Anti-Phishing on Fog Networks," in IEEE Transactions on Network and Service Management,
vol. 15, no. 3, pp. 1076-1089, Sept. 2018.
[8] C. N. Gutierrez et al., "Learning from the Ones that Got Away: Detecting New Forms of Phishing
Attacks," in IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 6, pp. 988-1001, 1 Nov.-
Dec. 2018.
[9] https://www.cloudflare.com/en-in/learning/bots/what-is-a-web-crawler/
[10] Noah Ndakotsu Gana, Shafi’I Muhammad Abdulhamid, "Machine Learning Classification Algorithms for
Phishing Detection: A Comparative Appraisal and Analysis", IEEE Nigeria Computer Chapter
(NigeriaComputConf) 2019 2nd International Conference of the, pp. 1-8, 2019.

Sudeep Seminar Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sudeep Seminar Final

Uploaded by

Copyright:

Available Formats

A seminar report on

“Web Crawling based Phishing attack Detection”

Submitted in partial fulfilment requirements for the award of the Degree

Sudeep Rao 4NM18IS121

Under the Guidance of

Mr. Abhishek S Rao

Department of Information Science and Engineering

NMAM Institute of Technology, Nitte 2021– 2022

Bachelor of Engineering degree.

Signature of the Guide Signature of the Signature of the

CHAPTERS Page No.

Department of Information Science and Engineering, NMAMIT 1

Department of Information Science and Engineering, NMAMIT 2

Department of Information Science and Engineering, NMAMIT 3

Fig 1. Architecture of proposed WC-PAD

Department of Information Science and Engineering, NMAMIT 4

4.1 DNS BLACKLIST

4.2 WEB CRAWLER

Fig 2. Web crawler for web indexing

Department of Information Science and Engineering, NMAMIT 5

4.3 HEURISTIC ANALYSIS

Algorithm 2 will explain the process in detail.

Department of Information Science and Engineering, NMAMIT 6

Department of Information Science and Engineering, NMAMIT 7

Department of Information Science and Engineering, NMAMIT 8

Figure 3. Phishing and legitimate website identification with 70:30 ratio

Table 1. Heuristics weight allocation

Department of Information Science and Engineering, NMAMIT 9

Table 2. Performance of WC-PAD

Department of Information Science and Engineering, NMAMIT 10

Department of Information Science and Engineering, NMAMIT 11

[4] Phishing Attacks. [Online]. Available: https://securityintelligence.com/

[7] C. Pham, L. A. T. Nguyen, N. H. Tran, E. Huh and C. S. Hong, "Phishing-Aware: A Neuro-Fuzzy

Department of Information Science and Engineering, NMAMIT 12

You might also like