CH 2. Literature Survey

Chapter-2 Literature survey for problem Identification and
specification.
Phishing Detection using machine learning is a prominent challenge presented in the

Smart India Hackathon 2023 [1], with a strong emphasis on enhancing cybersecurity.
Despite the existence of previous projects addressing this issue, it remains a pertinent
problem statement due to the ever-evolving nature of phishing attacks. In a phishing
attack, malicious actors impersonate trusted entities, deceiving individuals into revealing
sensitive information. These attackers often employ tactics like replicating legitimate
websites and disseminating malicious URLs through various mediums such as emails
and SMS. The expansion of online activities, notably during the COVID-19 pandemic,
has led to a increase in phishing attacks, making it necessary to develop robust detection
methods.
Detecting whether a website is a phishing attempt or legitimate has grown increasingly

intricate as cybercriminals continually refine their techniques and create novel deceptive
URLs. Traditional algorithms outlined in research papers often struggle to identify these
modified attacker strategies effectively. Consequently, developers face a difficult
challenge in distinguishing between phishing and legitimate websites. This issue
underscores the significance of addressing the problem in the Smart India Hackathon
2023. The need for innovative, resilient solutions is crucial to safeguard individuals from
falling prey to evolving cyber threats.
The research paper titled "Phishing Websites Detection using Machine Learning with
URL Analysis [2]" presented at the 2022 IEEE World Conference on Applied
Intelligence and Computing (AIC) by Areti Nagendra Soma Charan, Yu-Hung Chen,
and Jiann-Liang Chen, delves into the realm of identifying phishing websites through a
machine learning approach, primarily focusing on URL analysis. The key contribution
of this study lies in developing a system that can differentiate between phishing and
legitimate websites by scrutinizing the website's URLs and extracting significant
features.
The methodology employed in the research paper consists of several stages. Firstly, a
dataset of URLs is required to train the model. Following that, data preprocessing is
performed to clean and prepare the dataset. Features are then extracted from these
URLs, and machine learning models are trained and tested using this data. Seven
algorithms were implemented in this study, including Decision Tree, Random Forest,
Logistic Regression, XG Boost, Support Vector Machine, K-Nearest Neighbor, and
ADA Boost.
The primary aim of evaluating these different algorithms is to determine the most
effective method for our set of features. As per the research paper, Random Forest and
XG Boost algorithms displayed higher accuracy, reaching approximately 85%. In
summary, the study utilizes a dataset and feature extraction techniques to detect phishing
websites. Various machine learning algorithms are implemented to achieve this goal,
with Random Forest proving to be the most accurate choice.
This research paper explores the use of machine learning to identify phishing websites
by analyzing their URLs, and it demonstrates that some algorithms, like Random Forest
and XG Boost, can be particularly effective in achieving this goal.
The research paper titled "Variants of Phishing Attacks and Their Detection
Techniques [3]" presented at the Third International Conference on Trends in
Electronics and Informatics (ICOEI 2019) by authors G. Jaspher Willsie Kathrine,
Paradise Mercy Praise, Amrutha Rose, and Eligious Kalaivani C, discusses
different types of phishing attacks and methods to prevent them. The paper focuses on
using supervised machine learning techniques to these attacks.
The paper mentions some traditional methods for detecting phishing attacks. The first
one is the heuristic approach, which relies on predefined URLs. It works well for known
attacks but struggles with new ones because attackers keep changing their tactics. The
second method is the blacklisted approach, where URLs are compared to a list of known
phishing sites. This is helpful, but it can't catch new phishing sites that attackers keep
creating. The third approach is whitelisted, where URLs are checked against a list of
legitimate ones. Just like the blacklisted approach, it can't catch new, modified phishing
sites.
The paper suggests that using machine learning is a better way to identify phishing
websites. It introduces various machine learning algorithms for this purpose. In
conclusion, the research paper finds that machine learning is the most effective approach
to detect phishing attacks. It can adapt to new and evolving techniques used by attackers,
making it a more reliable way to protect users from these threats.
The research paper titled "Detecting Phishing Websites using Machine Learning"
written by Amani Alswailem, Bashayr Alabdullah, Norah Alrumayh, and Dr. Aram
Alserani, [4] introduces a system that employs machine learning to detect phishing
websites. In this study, a supervised machine learning algorithm, called the Random
Forest algorithm, is used because it performs well and gives good performance in
classification.
The Random Forest algorithm is efficient because it uses an ensemble technique called
bagging. This means it combines many decision tree models and takes a majority vote to
decide if a website is phishing or legitimate. Using multiple decision tree models helps
overcome a common issue in single decision trees, which is having either too much or
too little certainty (bias and variance). By combining these decision trees with Random
Forest, we get better results with less bias and variance.
In conclusion, the research paper finds that the Random Forest algorithm with its
ensemble technique is a strong choice for classifying websites as phishing or legitimate.
This method is effective in enhancing the accuracy of detection, making it a valuable
tool in the fight against phishing attacks.
Despite the numerous research papers implemented many supervised algorithms to

detect phishing attacks, it remains a challenging task for developers. The reason is that
while we improve our methods, attackers keep finding new ways to trick users with
phishing websites. They send deceitful links through various means. To tackle this, we
need fresh and unique techniques for detection. It's like a constant cat-and-mouse game,
where we must come up with clever ways to stay ahead of the attackers and protect users
from falling into phishing traps.
The research paper "Phishing Website Detection Based on Effective Machine
Learning Approach" [5] by Gururaj Harinahalli Lokesh and Goutham Bore Gowda
explores how machine learning can help identify phishing attacks and discusses their
pros and cons. They compared various supervised machine learning techniques like
Random Forest, K Nearest Neighbors, Decision Tree, Linear SVM classifier, and One-
Class SVM classifier. These methods use website details to decide if a website is
legitimate or phishing.
They tested these algorithms to see how well they work, considering terms like
accuracy, recall, and precision. One-Class SVM had an accuracy of 48.56%, Linear
SVM had 92.69%, K Nearest Neighbors achieved 93.53%, Decision Tree reached
96.05%, and Random Forest did the best with 96.87% accuracy because Random
Forest algorithm performed better than other methods or algorithms as mentioned above
because overfitting of data is avoided, which is one of the important feature. Hence
Random Forest classifier is best suited for us to detect more accurately whether the
website is phishing or not.
The research paper titled “A Comparative Analysis of Machine Learning-Based Website

Phishing Detection Using URL Information” [6] presented at the 2022 the 5th
International Conference on Pattern Recognition and Artificial Intelligence by Md. Milon
Uddin, Kazi Arfatul Islam, Muntasir Mamun, Vivek Kumar Tiwari, Jounsup Park
explores how machine learning can help detect phishing websites, which are sites that
try to steal your personal and financial information. As more of our daily activities move
online, the risk of falling victim to these scams increases. The paper focuses on using
machine learning techniques to differentiate between legitimate and phishing websites,
particularly by analyzing the URLs.
The researchers compared five machine learning algorithms: Decision Tree, Random
Forest, K-Nearest Neighbors, Gaussian Naive Bayes, and XGBoost. They conducted
experiments using a dataset that contained an equal number of legitimate and phishing
URLs, with a total of 11,430 URLs and 87 features.
The results showed that the Random Forest algorithm performed the best, with an
accuracy of 97.0%. This means it was very effective at identifying phishing websites.
The accuracy of XGBoost was also high, at 94.79%. The other methods had lower
accuracy levels.
The paper emphasizes the importance of feature extraction, which means selecting the
most relevant information from the dataset. This step significantly influences the
accuracy of phishing detection. They found that certain features, like the URL itself and
some related attributes, were crucial in making accurate predictions.
Phishing attacks often use emails and other communication channels to deceive users
into clicking on malicious links. Being able to identify phishing websites is essential to
protect users from these scams. Machine learning techniques, like those evaluated in this
paper, can play a vital role in achieving this goal.
In conclusion, the research demonstrates that using machine learning, particularly the
Random Forest algorithm, can be a powerful tool to detect phishing websites accurately.
This is essential for safeguarding users' online privacy and security. The paper also
suggests that further improvements can be made in the future, potentially increasing the
accuracy of phishing detection.
The research paper titled “An Efficient Phishing Attack Detection using Machine
Learning Algorithms” [7] presented at 2022 International Conference on
Advancements in Smart, Secure and Intelligent Computing (ASSIC) by P.
Chinnasamy, R. Selvaraj, K.Ramprathap, N.Kumaresan, S. Dhanasekaran, Sruthi Boddu
discusses the importance of detecting phishing attacks using machine learning
algorithms. Phishing is a method used by cybercriminals to steal personal information,
and it poses a significant risk to individuals, organizations, and even government
agencies. While there are various anti-phishing approaches available, many users do not
prefer for them due to their cost and complexity. To address this, the paper proposes a
software-based solution that uses heuristic methods for detecting phishing attacks.
The proposed methodology classifies whether a given link is phishing or not based on
features like web traffic and uniform resource locator (URL). The research implemented
machine learning algorithms, including Random Forest, Support Vector Machine (SVM)
to achieve this classification.
The paper presents a detailed literature review, exploring different techniques and
methods used for intrusion detection and phishing attack detection. It also provides an
overview of the proposed methodology, outlining the data collection, data description,
feature selection, and the use of different machine learning algorithms.
The performance evaluation section demonstrates that the Random Forest model
achieves the highest accuracy, with a rate of 94.73%. The paper includes figures and
graphs illustrating the results and provides a brief conclusion, highlighting the need for
more extensive network protection and suggesting future research directions.
The conclusion of this research paper is that emphasizes the importance of detecting and
preventing phishing attacks and proposes an effective method that uses machine learning
algorithms to classify potentially malicious links, thus enhancing online security.
After conducting literature review for our project on "Phishing Detection using
Machine Learning" we've reached a significant conclusion. Detecting whether a
website is legitimate or potentially a phishing site is a complex and critical task, and
various research papers confirm this. To address this challenge, we decided to explore
multiple supervised machine learning algorithms to find the one that offers the highest
accuracy.
Our research suggests that the Random Forest Algorithm has consistently
demonstrated better accuracy in phishing detection, as supported by several studies.
Given the importance of accurate detection to enhance online security, we intend to
implement various supervised machine learning algorithms to compare their
performance. Ultimately, we will select the algorithm that provides the highest accuracy
and reliability to proceed with our project.
In summary, the complex nature of phishing detection calls for a thorough examination
of different algorithms. We will make an informed choice based on their performance to
ensure the effectiveness of our project in identifying and preventing phishing attacks.
References :
1. https://www.sih.gov.in/
2. https://ieeexplore.ieee.org/document/9848895

CH 2. Literature Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 2. Literature Survey

Uploaded by

Copyright:

Available Formats

Chapter-2 Literature survey for problem Identification and

Phishing Detection using machine learning is a prominent challenge presented in the

Detecting whether a website is a phishing attempt or legitimate has grown increasingly

Despite the numerous research papers implemented many supervised algorithms to

The research paper titled “A Comparative Analysis of Machine Learning-Based Website

You might also like