You are on page 1of 14

NOVEMBER 2, 2022

MALICIOUS URL DETECTION


INFORMATION SECURITY ANALYSIS AND AUDIT
Harsh Avinash 20BDS0190
Sankalp Mukim 20BDS0128
Shriram A 20BCI0250
MALICIOUS URL DETECTION

ABSTRACT
In this hyperconnected age of the Internet, URLs are constantly being circulated on a variety
of different social media. And with the overabundance of these URLs, exploitation of
vulnerabilities has become commonplace, with malicious URLs. From malware to phishing,
the number of ways in which hackers and other people looking to gain access to your data,
accounts and devices, use malicious URLs to do so, is insurmountable. Currently, software like
website blockers and anti-viruses, offer a certain amount of protection from malicious URLs,
but they come with their pitfalls.

Anti-viruses are heavy applications and come at a premium cost. Website blockers don't block
malicious websites and are far from efficient. In response to this, in this paper, we propose
and implement a lightweight browser extension that offers swift, round-the-clock protection.
Requiring no setup, it is easy to customise to one's liking and scale as required, all free of cost.
Some issues that may be faced during implementation are related to Microsoft Azure's App
Service. After long periods without use, the App Service takes a long time to boot. In addition,
to keep the costs low, we use the free App Service, which comes with limitations.

Our research and experimentation revealed that malicious URLs have trends that allow them
to be easily identified, due to their discernible patterns. With the trends in hand, we built an
application to prevent users from accessing these malicious URLs.

INTRODUCTION
According to statistics, in January 2021, there were 4.66 billion active internet users
worldwide, more than half the global population. Of this total, 92.6 percent (4.32 billion) used
mobile devices to access the internet.

Clicking on these links may take you to a phishing website, or trigger the download of With
these staggering numbers of Internet connected devices and the incessant circulation of
media and websites in the form of Uniform Resource Locators (URLs), preying on the
vulnerable by means of these URLs has become commonplace.
Malicious URLs are those that redirect users to malicious or fraudulent websites and pages.
malware, intended to steal your data or corrupt your devices. These attacks have
compromised a significant proportion of the population’s data and devices. In the year 2021,
Kaspersky mobile products and technologies detected 3,464,756 malicious installation
packages, 97,661 new mobile banking Trojans, and 17,372 new mobile ransomware Trojans
(Shishkova, 2022).

MALICIOUS URL DETECTION 1


As per CISCO’s 2021 Cybersecurity Threat Trends report, around 90% of data breaches occur
due to phishing attacks, costing companies an average of $4.56 million.
Identifying these malicious URLs and preventing access to them is a topic that has garnered
attention from a variety of researchers across the world, making use of methods like
blacklisting, as well as Machine Learning algorithms and models. The blacklist-whitelist model,
while simple, efficient and in common use, requires rigorous upkeep, leading to high costs
and inevitable omission of malicious URLs.

The Machine Learning approach has been scrupulously researched in the past years due to
the significant promise it shows. In this paper, we propose a solution by means of predictive
analysis, identifying trends in malicious URLs, and using those trends to develop an
application, in our case, a browser extension, to keep individuals from accessing these sites.

PROBLEM STATEMENT
Use of predictive analysis to protect from malicious URLs

OBJECTIVES
I. To implement a browser extension that prevents users from accessing malicious URLs
by means of predictive analysis
II. To study the trends in a URL that may show signs of maliciousness.
III. To implement and test supervised machine learning algorithms like XG Boost and
Random Forest on the URL dataset
IV. To orchestrate the workings of the browser extension and the backend application
through the use of API calls and Azure App Service
V. To validate the output of the Machine Learning model by using it in a real-world
application.

LITERATURE REVIEW
DETECTION OF MALICIOUS URLS USING PARALLEL NEURAL JOINT MODEL

Yuan et al. (2021) propose a parallel neural joint model algorithm for analysis and detection,
wherein a visualisation algorithm is used to map the URLs to a grey image with texture
characteristics. The features of the URL are extracted and further processed through word
vector technology to then be transformed into vectors. A parallel joint neural network with
multiple networks is used to capture multi-modal vectors of visual and semantic information
synchronously. The last layer further filters the deep features extracted from the overall
network while concentrating on effective features to improve the classification accuracy.

MALICIOUS URL DETECTION 2


USING MACHINE LEARNING WITH SVM AND TREE CLASSIFIERS TO CLASSIFY URLS

Xuan et al. (2020) have proposed using machine learning techniques based on URL behaviours
and attributes. Big Data technology is also exploited to improve the capability of detecting
malicious URLs based on abnormal behaviours. The experimental results show that the
proposed URL attributes and behaviour can help improve the ability to detect malicious URL
significantly.
Patil and Patil (2018) propose and apply a combination of static and dynamic approach in
detecting malicious URLs. Features are extracted using static and dynamic analysis of the
URLs, making use of measures like the Shannon entropy of the URLs, suspicious words, and
length of the domain name. Methodology is evaluated using 6 Decision Tree algorithms - J48
Decision Tree, Simple CART, Random Forest, Random Tree, ADTree and REPTree. WEKA's
majority voting algorithm makes the final decision as to whether a URL is malicious or benign.
Upon introducing the new features, the accuracy for each Decision Tree classifier as well as
Majority Voting increased, with Majority Voting seeing a 0.61% increase. A similar increase is
seen for Precision, Recall and F-measure. The overall detection accuracy using the majority
voting classifier combined with new features was found to be 99.29%, significantly higher
than the 18 well known antivirus software at the time.

USING NEURAL NETWORKS TO PREDICT MALICIOUS URLS

BO et al. (2021) propose and introduce a system P to optimise the performance of a BP neural
network meant to detect malicious URLs. Membrane computing (also known as P system) is
a new distributed parallel information processing, and system modelling and simulation
technology. Membrane computing is used to design the optimization algorithm to improve
performance. PSO (Particle Swarm Optimization) is influenced largely by its inertia weight
(IW). Larger IW implies better search precision, and more particles swarm toward the optimal
point just found, creating problems if the point is the local optimal. Small IW implies faster
convergence speed and a large probability of jumping off the local optimum, creating
oscillation and precision problems. Membrane computing introduces a system P, with a
framework that outputs the ultimate optimum in an optimised manner. Upon
experimentation, the authors found that in all three measures of Accuracy, Precision, and
Recall, the proposed algorithm had better scores than the BP algorithm, implying better
classification performance.

URLNET

Le et al. (2018) propose URL Net, an end-to-end deep learning framework to learn nonlinear
URL embedding for malicious URL detection directly from the URL. Classification is done as a
binary classification task, obtaining a set of URLs then using a feature representation using an
n dimensional vector representing the URLs. To solve the issue of many rare words possibly
existing in the URL, the authors make use of advanced word embeddings URLNet receives a
string as an input (URL string), and applies CNN's to both the characters and words present in
the string. At the character level they identify unique characters in the corpus and represent
it as a vector, which enables the representation of the URL as a matrix. At the word level, they
identify unique words in the corpus which are delimited by special characters and then
represent it again using a word embedding matrix. After both these layers of processing are
MALICIOUS URL DETECTION 3
done, they are followed by a fully connected layer which is regularised by dropout, and then
finally, 4 fully connected layers that lead to the output classifier.

GAPS IDENTIFIED IN EXISTING LITERATURE

The proposed model requires substantial feature engineering. It is also unable to capture
unseen and undiscovered features (Yuan et al., 2021). Failure rate is still relatively high
(around 10%) on testing data (Xuan et al., 2020). Features from social networks to
characterise malicious URLs are yet to be investigated. Features of short URLs are yet to be
investigated with regards to effective detection.

The methodology lacks the analysis and detection of obscure JavaScript code in web pages
that are responsible for attacks like XSS and drive-by downloads. (Patil and Patil, 2018).
Neural networks make for robust but heavy models that require more computing power than
is available on the free tier of the Azure App Service, thus making it expensive to deploy. (BO
et al., 2021). The model cannot obtain embeddings for new words at test time. When too
many new words are obtained, a memory constraint may occur. When the false positive rate
is 0.0001, it performs worse than other simpler models do on prediction. (Le et al., 2018)

DESIGN
The architecture of our application is as given here. The application at its core, is a browser
extension. The browser extension can be installed by the user on their preferred browser.
Once installed, the extension starts to listen to new URLs that are being requested. For every
requested URL, the extension then makes an API call to the backend server. This backend
server is running on Azure App Engine.

MALICIOUS URL DETECTION 4


The backend server has been set up on a Python 3.9.6 environment. Gunicorn, a ASGI web
server implementation for Python, is running in this environment. It has 4 worker instances,
with 4 instances of the API running simultaneously. This was implemented to handle large
amounts of requests. All 4 workers share the same memory, and can communicate with each
other. If one worker is relatively busier than another, Gunicorn can route the request to a
worker with fewer tasks. Once the backend is done computing the request, it sends the data
back to the client’s browser extension, which will then decide what to do with the result.

For the backend deployment on Azure App Service, continuous integration and continuous
deployment has been implemented. The codebase of the backend is hosted on GitHub. When
a change is made to the codebase, GitHub recognizes it as a change, and then runs a GitHub
Actions script. This script is in charge of building and deploying the application to the azure
app service. Once GitHub Actions has completed execution, the new version can be seen on
the production deployment in just a few minutes. This way, deployment becomes automatic
and there is a reduced chance of errors.

IMPLEMENTATION
The implementation revolves around 2 main components. The detection and prediction of
malicious URLs is abstracted away by the backend, as described below. The user can install
the browser extension on Mozilla Firefox browser and then continue browsing nor mally. In
the background, as soon as the extension is installed, a main background script starts and
registers event listeners for all web requests the browser makes. This event listener listens
for the on Before Request event so that it can act before any potential damage is caused. At
this point, it sends the URL to the Fast API backend, and based on its response, it can act in 2
ways.

I. Let the request pass as it is, if the server says that it is benign.

II. Otherwise, if it is detected as malicious, we cancel the request and show the user
corresponding feedback. For this, the background script has to talk to the content
script and manipulate the DOM that is visible to the user.

The backend server receives the URL, and classifies the URL into malicious or benign URLs. It
makes use of a pre-built model that is saved on the server to perform said classification. The
pre-built model is an instance of the XG Boost Classifier, trained on a provided dataset.

The dataset consists of approximately 450,000 unique URLs, classified into malicious or
benign URLs. It is pre-processed to include the following features:
• Length of hostname
• Length of path
• Length of first directory
• Length of total directory
• Counts of “-”, “@”, “?”, “%”, “.”, “=”, and “http”
• Counts of digits, letters, and directories

MALICIOUS URL DETECTION 5


• Weather the host is an ip or domain

XGBoost is an implementation of gradient boosted decision trees designed for speed and
performance. It has capabilities of parallelization and distributed computing for faster training
of the model. It is sparse-aware, and can handle missing values or data efficiently. It supports
continued training.
One can boost an already trained tree.

MALICIOUS URL DETECTION 6


The API also has an endpoint that can rebuild the model in case anything goes wrong. The
backend then replies to the client’s request with the classification requested. As the server is
running on Azure App Engine, it is by default very secure. It is HTTPS enabled, and can be
scaled up or down depending on the requirements. It is also compatible with continuous
integration and continuous deployment.

RESULTS

Image 1: Github Repository for Backend

Image 2: Github Repository for Extension

MALICIOUS URL DETECTION 7


Image 3: Pre-processed Dataset

Image 4: Correlation Heatmap

MALICIOUS URL DETECTION 8


Image 5: Azure App Service

Image 6: Fast-API backend

MALICIOUS URL DETECTION 9


Image 7: Extension added to Mozilla Firefox

Image 8: Extension successfully blocks malicious looking URL

MALICIOUS URL DETECTION 10


Image 9: Extension allows normal website to render

MODEL CREATION:
I. The model takes 21 seconds approximately to train
II. It provides an accuracy of 99.72%
III. Recall Score = 0.9917, Precision Score = 0.9961
IV. Jupyter notebook of tests available at Github repository for backend

BROWSER EXTENSION:

I. Registered 2 scripts in manifest of the browser extension: background.js and


content_script.js
II. Background script intercepts all Web requests and with the help of the backend,
decides whether to allow the request or not.
III. Background script uses Content Script to Give user feedback on blocking a malicious
url.

APPLICABILITY CATEGORY

Preventing malware from invading systems and protecting individuals from phishing and
other cyber-attacks is relevant to nearly every sect and group of people or organisations. In
society, unsuspecting individuals, especially elderly people, frequently fall prey to phishing
attacks and other schemes. By preventing their mobile phones and computers from accessing
websites that perpetrate these attacks and thereby preventing the attackers from exploiting
them, society would benefit greatly. In health care, with hospitals and health care companies
often having a large database of personal information pertaining to their patients in addition
to the information required to run a business, protecting this data is vital to the operation of
MALICIOUS URL DETECTION 11
hospitals and the protection of the general population’s privacy. In social networking, there
are two aspects to consider, the first being the circulation of malicious URLs and the second
being unauthorised access to social media accounts. On platforms like WhatsApp and
Telegram, as well as on group chats on platforms like Facebook and Instagram, links to videos,
articles and posts are often shared and forwarded, multiple times, without guaranteeing that
these links are harmless. Restricting access to the websites behind these links is key to staying
safe from cyber-attacks. In addition, there are innumerable clones of Facebook, LinkedIn and
a multitude of other social networking sites that pose as the original, trick users into signing
into their accounts with their credentials and use those credentials to access their accounts.
With respect to interdisciplinary areas and team involvement, any domain that involves the
use of software, be it to store data, run their business or simply browse the web, would
benefit greatly from having an application that keeps their systems safe without them having
to meticulously check each website they visit. As for hardware, websites that overburden or
overwhelm a computer or mobile system’s RAM or processing unit, can cause long lasting
damage to the system. Protection from these websites is essential to any and all systems with
an active Internet connection.

CONCLUSION
I. Experimentally determined that there are certain patterns in a URL that can
help determine if it is malicious or not
II. Created a working application that can safeguard users from accessing
malicious and harmful websites

FUTURE ENHANCEMENTS
I. DNS servers of a URL can also be taken into consideration to perform predictive
analysis
II. The application’s backend can be hosted on VPS instead of app engine for 100%
uptime and scalability
III. If permissible, content of the URL can also be verified to not contain malware before
user is permitted to visit the website

REFERENCES

I. Shishkova, T. (2022, February 21). Mobile malware evolution 2021. Securelist.


https://securelist.com/mobile-malware-evolution-2021/105876/
II. Apps, S. C. (2022, January 18). Cyberattacks 2021: Statistics From the Last
Year. Spanning. https://spanning.com/blog/cyberattacks-2021-phishing-
ransomware-data-breach-statistics/
III. Yuan, J., Chen, G., Tian, S., & Pei, X. (2021). Malicious URL detection based on
a parallel neural joint model. IEEE Access, 9, 9464-9472.

MALICIOUS URL DETECTION 12


IV. Do Xuan, C., Nguyen, H. D., & Nikolaevich, T. V. (2020). Malicious URL
detection based on machine learning. International Journal of Advanced
Computer Science and Applications, 11(1)
V. Patil, D. R., & Patil, J. B. (2018). Malicious URLs detection using decision tree
classifiers and majority voting technique. Cybernetics and Information
Technologies, 18(1), 11-29.
VI. Bo, W., Fang, Z. B., Wei, L. X., Cheng, Z. F., & Hua, Z. X. (2021). Malicious URLs
detection based on a novel optimization algorithm. IEICE TRANSACTIONS on
Information and Systems, 104(4), 513-516.

MALICIOUS URL DETECTION 13

You might also like