Professional Documents
Culture Documents
Submitted by
Padiri Lokesh
Abhishek Dhanopia
June 2023
i
SCHOOL OF ELECTRICAL & ELECTRONICS ENGINEERING
THANJAVUR, TAMIL NADU, INDIA – 613 401
Bonafide Certificate
This is to certify that the thesis titled “Phishing Website Detection using machine
learning” submitted in partial fulfillment of the requirements for the award of degree B.Tech.
Electronics & Instrumentation Engineering to the SASTRA Deemed to be University, is a
bona-fide record of the work done by Mr. Padiri Lokesh( Reg no.123006093),Mr. Abhishek
Dhanopia(Reg no.123006904) during the academic year 2022-23, in the School of Electrical &
Electronics Engineering, under my supervision. This thesis has not formed the basis for the
award of any degree, diploma, associateship, fellowship or other similar title to any candidate
of any University.
Date :
Examiner 1 Examiner 2
ii
SCHOOL OF ELECTRICAL & ELECTRONICS ENGINEERING
THANJAVUR - 613 401
Declaration
We declare that the thesis titled “Phishing Website Detection using Machine Learning”
submitted by us is an original work done by us under the guidance of Dr. V.S.BALAJI ,
Senior Asst. Professor , School of Electrical and Electronics Engineering, SASTRA
Deemed to be University during the final semester of the academic year 2022-23, in the
School of Electrical and Electronics Engineering. The work is original and wherever we
have used materials from other sources, we have given due credit and cited them in the text of
the thesis. This thesis has not formed the basis for the award of any degree, diploma,
associate ship, fellowship or other similar title to any candidate of any University
Date:
iii
ACKNOWLEDGEMENTS
We would like to thank our Honorable Chancellor Prof. R. Sethuraman for providing us
with an opportunity and the necessary infrastructure for carrying out this project as a part of
our curriculum.
We extend our heartfelt thanks to Dr. K. Thenmozhi, Dean, School of Electrical &
Electronics Engineering and Dr. A. Krishnamoorthy, Associate Dean, Electronics and
Instrumentation Engineering.
Our guide Dr. V.S.Balaji, Senior Asst. Professor , School of Electrical & Electronics
Engineering was the driving force behind this whole idea from the start. His deep insight in
the field and invaluable suggestions helped us in making progress throughout our project
work.
We also thank the project review panel members for their valuable comments and insights
which made this project better.
We would like to extend our gratitude to all the teaching and non-teaching faculties of the
School of Electrical & Electronics Engineering who have either directly or indirectly helped
us in the completion of the project.
We gratefully acknowledge all the contributions and encouragement from my family and
friends resulting in the successful completion of this project. We thank you all for providing
us an opportunity to showcase my skills through the project.
iv
PHISHING WEBSITE DETECTION
ABSTRACT
In the current era, most of connected to the internet and various social media platform
indirectly keep all their personal information stored in mobile phones, and computer most of
theft are making the loophole and an intruder is trying to send them spam mailing to using
the present most educationalist are unknowing or by knowing we can click the link by the
phishers are trying theft our information so that we are providing a website detection via
machine learning we are preprocessing the data and we can trained the data by collecting
way2 messages website by using database we can train the model so that we can easily
predict the phishing .whereas data mining is unable to process the data so that we can
create the webpage based on we are login in we can find the phishing detection by
modeling and predict the score we can use various types of machine learning algorithms we
can which algorithm is best suits for prediction the results .
Specific contribution:
• Collection of data
• Developed the program for all algorithm
• Create a web interface GUI model for the project ,coding, report making
Specific Learning :
v
ABSTRACT
Phishing is one of the most common and dangerous attacks among cybercrimes. These
attacks aim to steal the information used by individuals and organizations. The phishing
website will appear the same as the legitimate website and directs the user to a page to
enter the personal details of the user on the fake website. Phishing websites contain various
hints in their contents and web browser-based information. Some of the previous work in
phishing detection is the Machine Learning approach, fuzzy logic. Machine Learning
algorithms are capable of handling large datasets efficiently and the performance of
machine learning-based techniques relies on the types of classifiers, and features used. In
the proposed method Machine Learning is used to implement the Ip system that can detect
the phishing website. The 30 features have been considered as the features of the website.
The Django server is implemented for implementing the API call with the browser extension
developed for user interaction. The user interacts with the system with the browser
extension and in the browser itself, the user gives the website URL and checks whether the
website is phishing or legitimate.
Specific Contribution:
Specific Learning:
Registration no:123006904
vi
TABLE OF CONTENT
TITLE Page. No
BONAFIDE CERTIFICATE ii
DECLARATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
LIST OF TABLES ix
ABBREVIATION x
1. INTRODUCTION 1
2. LITERATURE SURVEY 5
3. OBJECTIVE 6
4. METHODOLOGY 7
4.1 Implementation 7
4.2Features 7
4.3Address bar-based features 7
7 REFERENCES 19
7.1 Similarity Check Report 21
vii
LIST OF FIGURES
FIGURE NO TITLE Page no
viii
LIST OF TABLES
Table number Table Name Page number
2.1 Literature survey 5
4.1 Address bar based features 8
and its condition
4.2 Abnormal based features and 9
its condition
4.3 Domain based features and 10
its condition
ix
ABBREVATION
DNS-Domain Nameserver
x
CHAPTER 1
INTRODUCTION
1. INTRODUCTION TO PHISHING
Phishing is defined as, the attempt to obtain sensitive information such as usernames, passwords,
and credit card details, often for malicious reasons, by masquerading as a trustworthy entity in an
electronic communication. Trying to get unsuspecting users to give up their money, credentials or
privacy is a particularly insidious form of social engineering that can have disastrous effects on
people's lives.
The word phishing is an evolution of the word fishing by hackers who frequently replace the
letter 'f' with the letter 'ph' in a typed hacker dialect. The word arises from the fact that
users, or phish, are lured by the mimicked communication to a trap or hook that retrieves
their confidential information.
In the last few years, there has been an alarming trend of an increase in both the number
and sophistication of phishing attacks. As the definition suggests, phishing is a novel cross-
reed of social engineering and technical attacks designed to elicit confidentially; information
from the victim. The collected information is then used for several nefarious deeds including
fraud, identity theft, and corporate espionage. The growing frequency and success of these
attacks have d several researchers and corporations to take the problem seriously. They
have attempted to address it by considering new countermeasures and researching new
and novel techniques to prevent phishing.
Despite their many varieties, the common denominator of all phishing attacks is their use of
a fraudulent pretense to acquire valuables. Some major categories include are shown in
Figure 1 and descriptions of different types of phishing attacks are mentioned below:
1
Figure 1.1: Types of Phishing Attacks
Spear Phishing:
Spear phishing is a targeted phishing attack that involves highly customized lure
content. To perform spear phishing, attackers will typically do reconnaissance work,
surveying social media and other information sources about their intended target.
Spear phishing may involve user logging into fake websites and opening documents
by clicking on links that automatically install malware.
Whaling:
Whaling is a form of phishing in which the attack is directed at high-level or senior
executives within specific companies with the direct goal of gaining access to their
credentials and/or bank information. The content of the email may be written as a
legal subpoena, customer complaint, or other executive issue.
Clone phishing:
Clone phishing is a type of phishing attack whereby a legitimate, and previously delivered,
email containing an attachment or link has had its content and recipient addresses taken and
used to create an almost identical or cloned email. The attachment or link within the email is
replaced with a malicious version and then sent from an email address spoofed to appear to
come from the original sender. It may claim to be a resend of the original or an updated
version of the original. Typically this requires either the sender or recipient to have been
previously hacked for the malicious third party to obtain the legitimate email.
Link manipulation
Misspelled URLs or the use of subdomains are common tricks used by phishers.
Even digital certification does not solve this problem because it is quite possible for a phisher
to purchase a valid certificate and subsequently change content to spoof.
Filter evasion
Phishers have sometimes used images instead of text to make it harder for anti-phishing filters to
detect the text commonly used in phishing emails.
Website forgery
Some phishing scams use JavaScript commands to alter the address bar of the website The
fraudulent website that supports the phishing email is designed to mirror the legitimate
website it is purporting to be. The fraudsters use multiple methods to do this, including using
genuine-looking images and text, disguising the URL in the address bar, or removing the
address bar altogether. The purpose of the website is to trick consumers into thinking they
are at the company's genuine website and giving their personal information to the trusted
company they think they are dealing with.
Covert redirect
Covert redirect is a subtle method to perform phishing attacks that makes links appear
legitimate. out redirect a victim to an attacker's website. The flaw is usually masqueraded
under a log-in popup based on an affected site's domain.
Normal phishing attempts can be easy to spot because the malicious page's URL will usually
be different from the real site link. For covert redirect, an attacker could use a real website
instead y corrupting the site with a malicious login popup dialogue box.
Social engineering:
Users can be encouraged to click on various kinds of unexpected content for a variety of
technical and social reasons.
For example, a malicious attachment might masquerade as a benign linked Google Doc.
2
Alternatively, users might be outraged by a fake news story, click a link, and become
infected.
Voice phishing:
Not all phishing attacks require a fake website. Messages that claimed to be from a bank
told users to dial a phone number regarding problems with their bank accounts .
1.3 A TYPICAL PHISHING ATTACK:
Currently, the most common form of phishing attacks includes three key components: the
lure, the hook, and the catch. They are as described below.
The Lure consists of a phisher spamming a large number of users with an email message
that typically, in a convincing way appears to be from some legitimate institution that has a
presence on the internet. The message often uses a convincing story to encourage the user
to follow a URL hyperlink encoded in the email to a website controlled by the phisher and to
provide it with certain requested information. The social engineering aspect of the attack
normally makes itself known in the lure, as the spam gives some legitimate-sounding reason
for the user to supply confidential information to the website that is hyperlinked by the spam.
The Hook typically consists of a website that mimics the appearance and feel of that of a
legitimate target institution. In particular, the site is designed to be as indistinguishable from
the targets as possible. The purpose of the hook is for victims to be directed to it via the lure
portion of the attack and for the victims to disclose confidential information to the site.
Examples of the type of confidential information that is often harvested include usernames,
passwords, social- security numbers in the U.S.(or other national ID numbers in other parts
of the world), billing addresses, checking account numbers, and credit card numbers. The
Hook website is generally designed both to convince the victim of its legitimacy and to
encourage the victim to provide confidential information to it with as little suspicion on the
victim's part as possible.
The Catch is the third portion of the phishing attack, which some alternatively call the kill. It
involves the phisher or a cashier making use of the collected information for some nefarious
purpose such as fraud or identity theft.
1.4 EXISTING SYSTEM:
Whereas in the case of the existing system means that what is the previous system
says a Manual human intervention is not that much applicable and error-prone.
Legacy and Conventional Data Mining Algorithms can’t deal with huge volumes of
data, slower and more inaccurate.
Machine Learning is cutting edge and trending for different kinds of diverse
applications in a society where it can deal with tons of data, refined and revised
algorithms, and available heavy processing power in terms of GP algorithms, and
3
available heavy processing power in terms of GP
Architecture:
4
CHAPTER 2
Literature Survey
5
CHAPTER 3
OBJECTIVE
The platform will return the status of the URL aim to predict Phishing or
legitimate website
6
CHAPTER-4
METHODOLOGY
4.1 IMPLEMENTATION
To implement the Phishing website detection system the dataset is collected, the dataset I
collected from the Phishing tank where the phishing website list can be obtained. From the
phishing tank, 30,647 phishing websites are obtained. The legitimate website list is obtained
from the Alexa Ranking website from which the 58,000 legitimate websites can be obtained.
The other source is the UCI repository which contains the 11,045 website list which contains
both legitimate and phishing websites.
4.2 Features
The dataset is maintained with the 30 features of the websites. The features of the website
that are considered are classified as follows:
Address bar-based features
Abnormal-based features
HTML and javascript based features
Domain-based features
4.3Address bar-based features:
The address bar-based features can be retrieved by analyzing the URL of any website. There
are about 12 address bar-based features. They are mentioned in this section. The basic
structure of any URL is in the below format:
protocol://subdomain. Domain name. country code/directory/filename
URL is the first thing to analyze a website to decide whether it is a phishing URL or not. URI
of phishing domains has some distinctive points. Features that are related to these points an
obtained when the URL is processed. Some of the URL-Based Features are given below.
Digit count in the URL.
The total length of the URL.
Checking whether the URL is Typosquatting or not.
Checking whether it includes a legitimate brand name or not.
Using the IP address: If an IP address is used as an alternative of the domain name,using
the hexadecimal code. this type of URLS are not legitimate. The rule for this is
Long URL: Attackers use the long urls to hide the suspicious part in the address bar.
Tiny URL: URL shortening is a method on the "World Wide Web" in which a URL may be
made considerably smaller in length and still lead to the required webpage. This is
accomplished by means of an "HTTP Redirect" on a domain name that is short, which links
to the webpage that has a long URL
URL's having" @ "symbol: Using "@" symbol in the URL leads the browser to ignore
everything preceding the "@" symbol and the real address often follows the "@" symbol. The
occurrence of "//" symbol,prefix or suffix(-) symbol, number of dots etc. are comes under the
address bar based features
7
FEATURES CONDITION
Using IP address IF [domain part has IP address → phishing]
Otherwise→ legitimate}
Long URL IF{ URL length <54→ legitimate
{else if URL length → Suspicious
{otherwise→ phishing
Tiny URL IF{tiny URL→ phishing
{other wise→ Legitimate
URL having “@” symbol IF{ URL having@ symbol→ phishing
{otherwise→ Legitimate
Redirecting using”//” IF{ The position of the last occurrence of //in the URL>7→
→ phishing
{Otherwise→ Legitimate
Adding prefix or suffix IF{ Domain part includes(-)symbol→ Phishing
separated by(-) the domain {otherwise→ Legitimate
8
4.4 Abnormal based features
The features which are unusual such as request URL ,links , submitting
information to email etc all the kind of features comes under the abnormal
based features
Features CONDITIONS
Request URL IF{% of request URL <22% → Legitimate
{% of request URL >22% and 61% →
Suspicious
{ otherwise → phishing
URL of anchor IF{% of anchor <31% → Legitimate
{% of URL of anchor >31% and 67% →
Suspicious
{ otherwise → phishing
Links in< meta > scripts <link >tags IF{%links <17% → Legitimate
% of links >17% and 18% →
suspicious
Otherwise → phishing
Server from handler (SFH) IF{ SFH is about : blank or is empty →
phishing
Otherwise→ legitimate
Submitting information to email IF1{ using mail() or mail to : Function
to submit user→ phishing
Otherwise→ legitimate
Abnormal URL IF{ the hostname is not included in
url→ phishing
Otherwise→ legitimate
9
Page-Based Features are using information about pages which are calculated reputation
ranking services. Some of these features give us information about how much reliable a
website is. The features like page rank, global rank etc.
The below table mentions the domain-based features and its condition. To retrieve this of any
website application should be online.
Feature Condition
Age of domain IF { Age of domain > 6 → Legitimate
otherwise → phishing
Input Design:
In an information system, input is the raw data that is processed to produce output. During
the input design, the developers must consider the input devices such as PC, MICR, OMR,
etc.
Therefore, the quality of system input determines the quality of system output. Well-
designed input forms and screens have following properties −
• It should serve specific purpose effectively such as storing, recording, and retrieving
the information.
10
• It ensures proper completion with accuracy.
• All these objectives are obtained using the knowledge of basic design principles
regarding −
• To design source documents for data capture or devise other data capture methods
• To design input data records, data entry screens, user interface screens, etc.
Output Design:
The design of output is the most important task of any system. During output design,
developers identify the type of outputs needed, and consider the necessary output controls
and prototype report layouts.
• To develop output design that serves the intended purpose and eliminates the
production of unwanted output.
11
• To develop the output design that meets the end user’s requirements.
• To form the output in appropriate format and direct it to the right person.
MODULES:
1. User:
1.1 View Home page:
Here user view the home page of the phishing website prediction web application.
1.2 View Upload page:
In the about page, users can learn more about the phishing prediction.
1.3 Input Model:
The user must provide input values for the certain fields in order to get results.
1.4 View Results:
User view’s the generated results from the model.
1.5 View score:
Here user have ability to view the score in %
2. System
2.1 Working on dataset:
System checks for data whether it is available or not and load the data in csv files.
2.2 Pre-processing:
Data need to be pre-processed according the models it helps to increase the accuracy of
the model and better information about the data.
2.3 Training the data:
After pre-processing the data will split into two parts as train and test data before
training with the given algorithms.
2.4 Model Building
To create a model that predicts the personality with better accuracy, this module will
help user.
2.5 Generated Score:
2.6 Here user view the score in %
2.7 Generate Results:
We train the machine learning algorithm and calculate the personality prediction.
12
4.1 USE case diagram
13
4.3 SEQUENCE DIAGRAM
4.4COLLABORATION DIAGRAM:
14
CHAPTER-5
Here user view the home page of phishing website prediction web application.
5.2 Load:
15
5.3 View:
5.4 Model:
16
5.5 GRAPHS:
5.6 Prediction:
This interface shows the detection result that whether the website is a phishing website or
legitimate.
17
CHAPTER-6
18
CHAPTER-7
REFERENCES
3. T. Peng, I. Harris, and Y. Sawa, “Detecting Phishing Attacks Using Natural Language
Processing and Machine Learning,” Proc. - 12th IEEE Int. Conf. Semant. Comput.
ICSC 2018, vol. 2018–Janua, pp. 300–301, 2018.
6. K. Shima et al., “Classification of URL bitstreams using bag of bytes,” in 2018 21st
Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN),
2018, vol. 91, pp. 1–5.
9. X. Zhang, Y. Zeng, X. Jin, Z. Yan, and G. Geng, “Boosting the Phishing Detection
Performance by Semantic Analysis,” 2017.
10. L. MacHado and J. Gadge, “Phishing Sites Detection Based on C4.5 Decision Tree
Algorithm,” in 2017 International Conference on Computing, Communication,
Control and Automation, ICCUBEA 2017, 2018, pp. 1–5.
19
CHAPTER-7
APPENDIX
TITLE
https://drive.google.com/file/d/1wmrxTIRsBvYr9-EdsSYQm5m5_BcxEAN6/view?usp=share_link
20
7.1 Similarity Check Report
21