Professional Documents
Culture Documents
ON
BY
Under Guidance of
YEAR 2021-22
CERTIFICATE
This is to certify that Seminar Based learning report entitled
BY
CERTIFICATE
This is to certify that
Students of TE Information Technology was examined in Seminar Based Learning report entitled
…/…/2021
At
-------------------- ----------------------
(Internal Examiner) (External Examiner)
Has completed the Seminar Based Learning work under my guidance and that, I have verified the
work for its originality in documentation, problem statement, literature survey and conclusion
presented in seminar work
I would like to express my profound grateful to Dr. Gunjal B.L. (HOD IT) for providing an
opportunity to complete my academics and present this technical seminar, and for providing me
invaluable guidance for the technical seminar. I would like to show my greatest appreciation to
Prof. Muneshwar R.N. and Dr. Chaudhari M.A. (Seminar Coordinator). I can’t say thank you
enough for his tremendous support and help. The guidance and support received from all the
members who contributed and who are contributing to this report, was vital for the success of the
project. I am grateful for their constant support and help. The project on “Phishing Attack
Detection Using Artificial Intelligence” was very helpful to us in giving the necessary
background information and inspiration in choosing this topic for the project. Our sincere thanks
to Prof. Muneshwar R.N. and Dr. Chaudhari M.A. (Seminar Coordinator). Their
contributions and technical support in preparing this report are greatly acknowledged. Last but not
the least, we wish to thank our parents for financing our studies in this college as well as for
constantly encouraging us to learn engineering. Their personal sacrifice in providing this
opportunity to learn engineering is gratefully acknowledged.
TITLE PAGE
NO.
Certificate 3
i) Acknowledgement 5
ii) Abstract 6
v) List of Abbreviations 9
1. Introduction 10
2. Literature survey 11
3. Proposed Work 13
6. Future Scope 22
7. conclusion 23
References 24
Phishing attack 10
Example of KNN 19
ML Machine Learning
DL Deep Learning
NN Neural-networks
AI Artificial intelligence
HT Hybrid Technique
• A phishing attack has become one of the most prominent attacks faced by internet users,
governments, and service-providing organizations.
• Phishing is the most powerful and popular attack for hacking into emails and web
documents.
• Phishing attack uses fake websites to take sensitive client data, for example, account login
credentials, credit card numbers, etc. Cyber criminals used this attack to hack into bank
account, Facebook account and email account of innocent people.[4]
• Every year most of the biggest cybercrime case involve this attack so we must know what
phishing is and how to protect your accounts from phishing attack.
• For detection of Phishing attack, we used Artificial Intelligence can detect spam phishing,
skewers phishing, and different sorts of attacks.[3]
• Phishing is a type of social engineering attack often used to steal user data, including login
credentials and credit card numbers. It occurs when an attacker masquerading as a trusted
entity, dupes a victim into opening an email, instant message, or text message.
Phishing is a form of cybercrime where an attacker imitates a real person / institution by promoting
them as an official person or entity through e-mail or other communication mediums. In this type
of cyber-attack, the attacker sends malicious links or attachments through phishing e-mails that
can perform various functions, including capturing the login credentials or account information of
the victim. These e-mails harm victims because of money loss and identity theft. In this study, a
software called "Anti Phishing Simulator" was developed, giving information about the detection
problem of phishing and how to detect phishing emails. With this software, phishing and spam
mails are detected by examining mail contents. Classification of spam words added to the database
by Bayesian algorithm is provided.
Paper 1
Abdul Basit, Mahram Zafar Xuan Liu Abdul Rahman Javed ·Zunera Jalil · Kashif Kifayat
A comprehensive survey of AI-enabled phishing attacks detection techniques (IEEE) -
October 2020)
A comparative study of previous works using different approaches is discussed in the above section
with details. Machine learning based approaches, deep learning based approaches, scenario-based
approaches, and hybrid techniques are deployed in past to tackle this problem. A detailed
comparative analysis revealed that machine learning methods are the most frequently used and
effective methods to detect a phishing attack. Different classification methods such as SVM, RF,
ANN, C4.5, k-NN, DT have been used. Techniques with feature reduction give better performance.
Classification is done through ELM, SVM, LR, C4.5, LC-ELM, kNN, XGB, and feature selection
with ANOVA detected phishing attack with 99.2% accuracy, which is highest among all methods
proposed so far but with trade-offs in terms of computational cost.
Paper 2
Muhammet Baykara, Zahit Ziya Gürel - Detection of phishing attacks- March 2018
In this study, a software called "Anti Phishing Simulator" was developed, giving information about
the detection problem of phishing and how to detect phishing emails. With this software, phishing
and spam mails are detected by examining mail contents. Classification of spam words added to
the database by Bayesian algorithm is provided.
Ivan Ortiz-Garc ‘es’, Roberto O. Andrade†, and Maria Cazares - Detection of Phishing
Attacks with Machine Learning Techniques in Cognitive Security Architecture -March 2019
The number of phishing attacks has increased in Latin America, exceeding the operational skills
of cybersecurity analysts. The cognitive security application proposes the use of big data, machine
learning, and data analytics to improve response times in attack detection. This paper presents an
investigation about the analysis of anomalous behavior related with phishing web attacks and how
machine learning techniques can be an option to face the problem. This analysis is made with the
use of an contaminated data sets, and python tools for developing machine learning for detect
phishing attacks through of the analysis of URLs to determinate if are good or bad URLs in base
of specific characteristics of the URLs, with the goal of provide real-time information for take
proactive decisions that minimize the impact of an attack.AI is one of these possible solutions, it
can help to detect anomalous behavior, but even better AI can offer new possibilities to protect
sensible information, and it is capable to detect anomalous behavior quickly; this is why is so
important in new cybersecurity approaches.
Paper 4
Phishing attacks remain one of the major threats to individuals and organizations to date. As
highlighted in the article, this is mainly driven by human involvement in the phishing cycle. Often
phishers exploit human vulnerabilities in addition to favouring technological conditions (i.e.,
technical vulnerabilities). It has been identified that age, gender, internet addiction, user stress, and
many other attributes affect the susceptibility to phishing between people. In addition to traditional
phishing channels (e.g., email and web), new types of phishing mediums such as voice and SMS
phishing are on the increase. Furthermore, the use of social media-based phishing has increased in
use in parallel with the growth of social media. Concomitantly, phishing has developed beyond
obtaining sensitive information and financial crimes to cyber terrorism, hacktivism, damaging
reputations, espionage, and nation-state attacks. Research has been conducted to identify the
motivations and techniques and countermeasures to these new crimes, however, there is no single
solution for the phishing problem due to the heterogeneous nature of the attack vector. This article
has investigated problems presented by phishing and proposed a new anatomy, which describes
the complete life cycle of phishing attacks.
PROJECT TITLE
BACKGROUND
NEED OF STUDY
With the significant growth of internet usage, people increasingly share their personal information
online. As a result, an enormous amount of personal information and financial transactions become
vulnerable to cybercriminals. Phishing is an example of a highly effective form of cybercrime that
enables criminals to deceive users and steal important data. This article aims to evaluate these
attacks by identifying the current state of phishing and reviewing existing phishing techniques. [1]
Personal computer clients are victims of phishing attack because of the few primary reasons:
(1) Users do not have brief information about Uniform Resource Locator (URLs),
(2) The exact idea about which pages can be trusted,
(3) Entire location of the page because of the redirection or hidden URLs, (4) The URL possess
many possible options, or some pages accidentally entered,
(5) Users cannot differentiate a phishing website page from the legitimate ones.
OBJECTIVES OF STUDY
Outcomes whether the user has to be notified that the website is a phishing or aware user that the
website is safe.
PREVENTION
DURATION OF PROJECT
Starting
Ending
Step 1: Dataset:
The first step in building the proposed phishing email classifier is choosing the suitable training
data set which is a real sample of existing emails that consists of both phishing and legitimate
emails (also known as spam and ham emails). The training data set will be used to discover
potentially predictive relationships that will serve as building blocks in the classifier. Our training
data set consists of 10538 emails including 5940 ham emails from spam assess in project [5] and
4598 spam emails from Nazario phishing corpus [5].
Step 2: Pre-processing:
This is the first stage that is executed whenever an incoming mail is received. This step consists of
tokenization. Tokenization: This is a process that removes the words in the body of an email. It
also transforms a message to its meaningful parts. It takes the email and divides it into a sequence
of representative symbols called token.
The datasets used here are spam base available at https://archive.ics.uci.edu/ml/ datasets/Spam
base, and personal mail data. The dataset other than personal mails are already feature extracted
and need not be reprocessed. The personal mails are available in raw format and hence needs
header feature extraction. Personal mail at https:// www.cs.cmu.edu /~./Enron/[21], which are
large in number, 0.5M, are feature extracted first and then normalized and then fed to weak a server
for classification. The subject words in email header can be analysed to see if all letters are capital,
if that is the case it is likely that it is a spam as spammers try to highlight or attract attention by
putting every letter in capital. Also, cleverly written words like Money written as M0ney, money,
m o n e y, mooney, M O N E Y etc. are some of the tricks used by spammers and are taken care of
during pre-processing. During the pre-processing stage, a python script is used to segregate such
email as spam.
KNN (K-Nearest Neighbours) is one of the very straightforward supervised learning algorithms.
However, unlike the traditional supervised learning algorithms, such as Multinomial Naive Bayes
algorithm, KNN doesn’t have an independent training stage, and then a stage where the labels for
the test data are predicted based on the trained model. Rather, the features of every test data item
are compared with the features of every training data item in real time, and then the K nearest
training data items are selected, and the most frequent class among them is given to the test data
item.
In the context of email classification (spam or ham), the features to be compared are the frequencies
of a words in each email. The Euclidean distance is used to determine the similarity between two
Once the Euclidean Distance between a test email and each training email is calculated, the
distances are sorted in ascending order (nearest to farthest), and the K-nearest neighbouring emails
are selected. If the majority is spam, then the test email is labelled as spam, else, it is labelled as
ham.
In the example shown above, K = 5; we are comparing the email we want to classify to the nearest
5 neighbours. In this case, 3 out of 5 emails are classified as ham (non-spam), and 2 are classified
as spam.
Application:
1. Internet fraud
2. Identify theft
Advantages:
1. Build secure connection between user mail transfer agent and mail user agent.
2. Eliminate the cyber threat risk level.
3. Protect valuable corporate and personal data
Disadvantages:
1. Need large mail server and high memory requirement.
Cloud service providers have already implemented a number of security features to proactively
identify phishing attacks. Machine learning, improved email filtering, and malicious URL
detection are just a handful of capabilities that keep users safe on the web. Some providers even
warn users when replying to emails outside of their corporate domains, particularly important in
an enterprise setting. [4]
While cloud providers are often quick to recognize large scale attack and inform the public about
the right precautions to take when opening shared files, many individuals and organizations are
still subject to costly breaches. Educating users to best practices and making them aware of what
to look for can go a long way in protecting data; organizations must also take a proactive approach
to detecting these threats as they evolve. [4]
Future scope can be to integrate the system with the email service providers to develop a foolproof
system such that the email can be stopped before reaching the user and thereby attacking the
phisher beforehand.
We designed and provide proactive solution by which we can identify given URL is malicious. As
per study I concluded that, there are some of the hooks – or signs of a phishing email – that can
indicate an email is not as genuine as it appears to be like, an Unfamiliar Tone or Greeting,
Grammar and Spelling Errors, Inconsistencies in Email Addresses, Links & Domain Names
Threats or a Sense of Urgency, Suspicious Attachments. To prevent this, it's important to learn
about the tactics of phishers. Should be trained on security awareness as part of their orientation.
Inform them to be wary of e-mails with attachments from people they don't know. [4]
1) https://www.frontiersin.org/articles/10.3389/fcomp.2021.563060/full
2) https://www.researchgate.net/figure/Stages-in-a-Phishing-attack_fig1_235947501
3) https://www.google.com/search?q=detection+of+phishing+attack+using+artificial+intelli
gence&oq=detection+of+phishing+attack+using+artificial+intelligence+&aqs=chrome..6
9i57j69i59j35i39.6311j0j7&sourceid=chrome&ie=UTF-8
4) https://www.google.com/
5) https://easychair.org/publications/preprint_open/Kpsq
6) https://towardsdatascience.com/spam-email-classifier-with-knn-from-scratch-python-
6e68eeb50a9e
7) https://github.com/diegoocampoh/MachineLearningPhishing
8) https://www.enjoyalgorithms.com/blog/email-spam-and-non-spam-filtering-using-
machine-learning/