Professional Documents
Culture Documents
PAPER NAME
6 Pages 149.5KB
Summary
Performance analysis of Data mining, Machine
learning and Fuzzy logic algorithms for detecting
phishing URLs
Aryan Dosajh (2K18/MC/023) Ashmeet (2K18/MC/025) Anirudh Awasthi (2K18/MC/016)
(Mathematics and Computing) (Mathematics and Computing) (Mathematics and Computing)
Delhi Technological University Delhi Technological University Delhi Technological University
Delhi, India Delhi, India Delhi, India
aryandosajh2 k18mc23@dtu.ac.in ashmeet2 k18mc025@dtu.ac.in anirudhawasthi2 k18mc016@dtu.ac.in
Abstract—Phishing is now one of the leading Cyber threats, mation, such as our credit/debit card information? Obviously,
where the victim’s sensitive information like Username, Pass- you will be less hesitant when entering your credentials into a
word, Payment Card details are obtained by an illegitimate trusted e-commerce website, and you will be concerned when
website whose link is generally shared via E-mails. Such sites
are generally fake created by the crook which is made similar to submitting your data into a website where you are hearing
the trustworthy site. These web applications look like an official the name for the first time. This is because you’ve never used
page of any company such as an e-commerce websites, bank this website before, and you’re more likely to input critical
applications, college portals, etc. but have a slight variation in the information on a website that you’ve used previously and have
URLs which the user generally misses out on. These websites can a high trust-factor for. Now, hackers generally take advantage
also be unique websites which aren’t clones of any other websites
but instead promise the user of some kind of rewards in exchange of this psychology, or to put it another way, this trust-factor,
for their personal details. Our aim through this project is to and pose as a trustworthy company in order to steal your
detect such websites so that we can caution the user to proceed to sensitive data or personal information. This is referred to as
such websites only if necessary. We’ve implemented two different phishing. Phishing is an assault on a target’s personal/sensitive
approaches to solve this problem. One being a machine learning information, such as passwords, bank account numbers, and
classification approach which includes Fuzzy pattern classifiers
both the top down and bottom up algorithms while the other email addresses, by impersonating a convincing company.
makes use of data mining as data mining techniques can be an Fishing, as the name implies, is a sport in which we utilise
effective tool to detect phishing websites. a worm or other bug as a bait to capture fish. In a phishing
Index Terms—Fuzzy logic, phishing, data mining, malicious assault, the bait is the website, which is generally a cloned
urls version that appears to be genuine. But then we submit our
useful information and fall prey to hacker’s seduction. You are
I. I NTRODUCTION tricked into entering your personal information on that website
Every day, everyone who uses the internet uses a web by a hacker.
application. They may be accessed through our smartphone
II. TYPES OF PHISHING TECHNIQUES
applications, some on our desktops or laptops, or through a
web browser. Consider the following scenario: we’re using a A. Email Phishing
web browser to purchase something online. Now, when we The majority of phishing assaults are carried out via email.
look for a thing, we come across several websites and we In email phishing, the target is generally unknown. The hacker
like the goods on two of them. Now, one of these websites or thief sends an automated email that looks like it came from
is a well-known, secure, and reliable e-commerce site, while one of our social networking sites. The thief then acquires
the other is less secure and well-known yet offers you a access to your email account by tricking you into putting your
considerably higher discount on your product. Now that we’ve sensitive credentials or data into that. Because your contacts
decided to pay for our merchandise online, we must input are linked to your email account, hackers gain access to all of
our credit/debit card information. The next question is, which them and send them an email including your ID, which they
website would you trust with your personal or sensitive infor- use to deceive them. This is how it all comes together.
B. Spear Phishing which method is more effective and which makes more sense
It is more personalized and more targeted form of phishing. to use for determination of these websites from the original
The target is known well here, a typical scenario is where or the good websites [?], [6]. Since it is feasible to rely on an
a hacker knows everything about the target such as their accurate clustering algorithm that can handle a large number
name, occupation, address, family members, and even their of samples in a reasonable length of time. As a result, a novel
16
hobby. Then the attacker sends an email to the target which is approach, fuzzy rough set–Web robot detection (FRS-WRD),
16
professionally created or cloned as per say which seems like based on fuzzy rough set theory, is suggested in this study to
it comes from a believable source. Then the email is tailored better classify and cluster Web visitors of three real-world Web
18
to each recipient sites. [28]. In [21], the author used the URL to detect phishing
sites automatically by extracting and verifying different terms
20
C. Angular Phishing of a URL through search engine. A real-time anti-phishing
It is also called social media phishing (phishing through system based on seven distinct classification algorithms and
social media). In this form of phishing, a hacker sends out an natural language processing (NLP) characteristics is proposed
email or posts a message on your social media with a link in this paper. The method differs from prior research in the
15
or pretending as a customer service agent and the crook lures literature in the following ways: language independence, use
victims to hand over confidential information. of a large amount of phishing and genuine data, real-time
execution, identification of new websites, independence from
D. Social Engineering third-party services, and use of feature-rich classifiers. A fresh
19
Social contact is used to carry out this sort of phishing. It dataset is created to measure the system’s performance, and
employs psychological techniques to fool users into disclosing the experimental findings are tested on it. According to the
security information. This form of attack is carried out in a experimental and comparative findings of the developed clas-
series of phases. First, the scammer researches the probable sification algorithms, the Random Forest method with solely
weak areas of the targets that would be used in the attack. NLP-based characteristics performs best, with more than 95
13
The scammer then attempts to acquire the victim’s trust percent accuracy rate in detecting phishing URLs [12]. Fuzzy
before presenting a circumstance in which the target provides pattern tree induction, a unique machine learning approach
26
sensitive information. Baiting, scareware, pretexting, and spear for categorization, was recently introduced. A pattern tree is
phishing are some social engineering phishing techniques. a hierarchical, tree-like structure with inner nodes denoted by
4
generalised (fuzzy) logical operators and leaf nodes denoted
E. Links manipulation by fuzzy predicates on input characteristics. A pattern-tree
The major focus of phishing is on links. There are various classifier is made up of a collection of these pattern trees, one
ingenious techniques to modify a URL so that it seems to be a for each class label. This sort of classifier is intriguing for a
13
real URL. One approach is to display harmful URLs as hyper- number of reasons. The learning method switches the pattern
links with names on websites. Another way is to use misspell tree construction direction from bottom-up to top-down. In
URLs that appear like valid URLs, such as ghoogle.com. IDN addition, a new termination criterion is presented that is more
Spoofing is a type of typosquatting that is much more difficult suited to the learning issue at hand [1].
to detect than the previously mentioned link manipulation
3 IV. PROPOSED METHODOLOGY
methods because the attackers use a character in a non-English
language that looks exactly like an English character, such as As we have discussed earlier, what phishing is, types of
a Cyrillic ”c” or ”a” instead of their English counterparts. phishing and the related anti-phishing approaches, that are
generally used in system software and antivirus. Now, there’s
III. RELATED WORK one more feature named Data mining which not only confirms
24
Anti-phishing technology is meant to keep you safe from the full-scale features of the URLs as per the heuristic design
phishing scams. Many anti-phishing approaches, including but also work for very large dataset’s in order to remove those
software-based anti-phishing strategies, have been presented. classes where heuristics couldn’t work up on and to correct
The following are some of those methods: a) Detection by verdict. Please refer to Fig. 1. to understand the logical flow.
Blacklist, b) Detection by Visual similarity c) Detection by Now the techniques we have used are
12
Heuristic based approach. For an effective phishing detection
approach based on machine learning, overall testing findings A. Dataset Description
reveal that when combined with the help of SVM classifier, One of the most difficult aspects of our research was the
the suggested approach has the best performance, successfully paucity of phishing datasets. Despite the fact that several sci-
12 3
discriminating 95.66 percent of phishing and suitable websites entific publications on phishing detection have been published,
while utilising just 22.5 percent of the original functionality most of them have not given the dataset that they utilized in
[11]. People have used many machine learning models to their research. We have used 2 different datasets to implement
detect phishing websites. We have compared these studies our code and algorithms. The first dataset that we used in our
of the following papers and the Machine Learning methods research has been sourced from the UCI repository. It contains
they have used to implement and get a comparative study of 25 features which can be classified into 4 types namely,
Status bar Customization
•
Disabling right click
•
• Using pop-up window
• IFrame Redirection
TOP SOURCES
The sources with the highest number of matches within the submission. Overlapping sources will not be
displayed.
abacademies.org
1 3%
Internet
arxiv.org
3 2%
Internet
en.cs.uni-paderborn.de
4 2%
Internet
HCUC on 2022-04-01
8 1%
Submitted works
Sources overview
Similarity Report ID: oid:27535:15688990
crownstone.rocks
9 <1%
Internet
towardsdatascience.com
11 <1%
Internet
researchgate.net
15 <1%
Internet
link.springer.com
16 <1%
Internet
th-owl.de
17 <1%
Internet
Ba Lam To, Luong Anh Tuan Nguyen, Huu Khuong Nguyen, Minh Hoa...
18 <1%
Crossref
avesis.yildiz.edu.tr
19 <1%
Internet
ijict.itrc.ac.ir
20 <1%
Internet
Sources overview
Similarity Report ID: oid:27535:15688990
coek.info
22 <1%
Internet
Mehek Thaker, Mihir Parikh, Preetika Shetty, Vinit Neogi, Shree Jaswal....
24 <1%
Crossref
yuhuaqian.net
25 <1%
Internet
pure.aber.ac.uk
26 <1%
Internet
tojqi.net
27 <1%
Internet
Sources overview