You are on page 1of 6

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 7, July - 2015. ISSN 2348 4853

Application of Malicious URL Detection In Data Mining


Mr. Jadhav Bharat S.*1, Dr. Gumaste S.V. *2.
*1 M.E. student Department of Computer Engineering, Sharadchandra Pawar College of Engineering,
Dumbarwadi, Otur, Pune, Maharashtra, India
*2 Associate Professor And Head Department of Computer Engineering, Sharadchandra Pawar College of
Engineering, Dumbarwadi , Otur, Pune, Maharashtra, India
bharatjadhav754@gmail.com*1 , svgumaste@gmail.com*2

ABSTRACT
Recently, major computer attacks are launched by visiting a malicious webpage. In this paper we
have to construct a real-time system that uses machine learning techniques to detect malicious URLs
(spam, phishing, exploits, and so on). So that, we have determine techniques that involve classifying
URLs based on their lexical and host-based features, as well as online learning to process large
numbers of examples and adapt quickly to evolving URLs over time. However, in a real-world
malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is
highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy.
Besides, another limitation of the previous work is to assume a large amount of training data is
available, which is impractical as the human labeling cost could be quite expensive. A user can be
tricked into voluntarily giving away confidential information on a phishing page or become victim to
a drive-by download resulting in a malware infection. A malicious URL is a link pointing to a malware
or a phishing site, and it may then propagate through the victim's contact list. Moreover, hacker
sometimes might use social engineering tricks making malicious URLs hard to be identified. To solve
these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning
(CSOAL).
Index Terms : Malicious URL Detection, Cost-Sensitive Learning, Online Learning, Active Learning.

I.

INTRODUCTION

The WWW allows people to access all information on the internet, but it also brings fake information, such as
fake drug, malware, and so on. Criminal enterprises such as spam-advertised commerce (e.g., counterfeit
watches or pharmaceuticals), financial fraud (e.g., via phishing) and as a vector for propagating malware
(e.g., so-called drive-by downloads). [1][2]A user accesses all kinds of information (Trusted or Suspicious)
on the Web by clicking on a URL (Uniform Resource Locator) that links to a particular website. It is thus
important for internet users to find the risk of clicking a URL in order to avoid check accessing the malicious
web sites.
Although the exact adversary mechanisms behind web criminal activities may vary, they all try to lure users
to visit malicious websites by clicking a corresponding URL (Uniform Resource Locator)[3]. The most
motivational things behind these schemes may differ; the common thing among them is the requirement that
unsuspecting users visit their sites.
12 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 7, July - 2015. ISSN 2348 4853
These visited sites can be driven by email. Web search results or links from other Web pages, but all require
the user to take some action, such as clicking, that specifies the desired Uniform Resource Locator (URL).A
URL is called malicious (also known as black) if it is created in a malicious purpose and leads a user to a
specific threat that may become an attack, such as spyware, malware, and phishing. Malicious URLs are a
major risk on the web. Therefore, detecting malicious URLs is an essential task in network security
intelligence. If anyone could inform users before visiting that a particular URL was dangerous to visit, so the
problem could be avoided.
The security community has responded by developing blacklisting services. And that services provided to
user that the particular site is malicious or not. These blacklists are constructed by extracting the features
from the URL. And depending on the features classifier can divide them into white list and black list.
Although, many Suspicious sites are not blacklisted either because they are recently launched, were never
visited by user, or were checked incorrectly (e.g., due to cloaking)[4][5][6][7]. To address this problem,
some client-side systems analyze the content or behavior of a Web site as it is visited. But, in addition to runtime overhead, these approaches can expose the user to the very browser-based vulnerabilities that we seek
to avoid.
In this paper, we focus on a complementary part of the design space: URL classificationthat is, classifying
the reputation of a Web site entirely based on the URL. The motivation is to provide inherently better
coverage than blacklisting based approaches (e.g., correctly predicting the status of new sites) while avoiding
the client-side overhead and risk of approaches that analyze Web content on demand[8][9][10][11]. In
particular, we explore the use of statistical methods from machine learning for classifying site reputation
based on the relationship between URLs and the lexical and host-based features that characterize them.
II. PROBLEM STATEMENT
Our main purpose, behind that is we treat URL as a binary classification problem where true examples are
malicious URLs and false examples are benign URLs. This approach to the problem can succeed if the
distribution of extracted feature values for malicious examples is different from benign examples, the
training set shares the same extracted feature distribution as the testing set.
Also, we classify URLs based only on the relationship between URLs and the lexical and host-based features
that characterize them, and we do not consider two other kinds of potentially useful sources of information
for features: the URLs page information, and the content of the URL[12][13][14][15] (e.g., the URL which
embedded the page or email).
This information is very useful to improve classification accuracy, so we exclude it for following reasons.
1. Avoiding downloaded page material is safer for the users.
2. Classifying a URL with a trained model is a lightweight operation compared to first downloading the
page and then using its contents for classification.
3. Concentrating on URL features makes the classifier applicable to any context in which URLs are found
(Web pages, email, chat, calendars, games, etc.), rather than dependent on a particular application
setting.
4. Reliably obtaining the malicious version of a page for both training and testing can become a difficult
practical issue.
13 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 7, July - 2015. ISSN 2348 4853
Suspicious sites have show the ability to cloak the content of their Web pages, that is, showing different
contents to different clients .For example, a Suspicious server may send benign versions of a page to honey
pot IP addresses that belong to security practitioners, but send Suspicious versions to other clients[10][15].
III. URL RESOLUTION
URLs are human-readable text strings. Through a multistep resolution process, browsers translate each and
every URL into instructions that locate the server hosting the site and specify where the site or resource is
placed on that host[6][8][10][14]. Standard syntax of URL is
[Protocol]:// [hostname][path]

Fig. 1. Example of a Uniform Resource Locator (URL) and its components.


Contents of Protocol
1) Protocol:- This portion of the URL indicates which network protocol should be used to fetch
the requested resource. The most usable protocols are Hypertext Transport Protocol or HTTP
(http).In Figure 1, all of the example URLs specify the HTTP protocol[10].
2) Path:-This portion of a URL is the path name of a file on a local computer. In Figure 1, the
path in the example is /~jtma/url/example.html. The path tokens, delimited by various
punctuation such as slashes, dots, and dashes, and it shows how the site is organized[11]
3) Hostname :- This portion of URL is the identifier for the Web server. Mainly it is a machinereadable Internet Protocol (IP) address, but from the user perspective it is a human-readable
domain name.In IPv4 addresses are 32-bit integers that are mainly represented as a dotted
quad. In dotted quad notation, we divide the 32-bit address into four 8-bit bytes[15].
IV. FEATURES OF URL
In this paper we have proposed a system For detecting the malicious URL. To practically implement the
proposed System and detect malicious URL we have taken total 700 website URL in that 500 websites URL
are Real website URL i.e. without any malicious data in it. And remaining 200 website URL are malicious
ones. But they are randomly placed in our dataset .So to detect which website URL is malicious URL, we have
proposed different features of extraction they are length, Number of dots, TTL, get info.
1) Length: - We find the length and on that basis of the length of the website URL we can detect that the
14 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 7, July - 2015. ISSN 2348 4853
URL is real or malicious one
2) Number of full-stop: - We can also detect the URL is malicious or not on the basis of dots present in
the whole URL
3) TTL:- It stands for Time to live, Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that
tells a network router whether or not the packet has been in the network too long and should be
discarded. For a number of reasons, packets may not get delivered to their destination in a
reasonable length of time.
4) Get Info:-In this parameter the information about the registrar is present , here detail inform of
every website is present i.e. on Whose name the Website is registered , Complete details of that
Registrar are present here . It is also one of the feature on the basis of which we can detect the URL is
Malicious or not .
5) Date: - is also one of the feature for detecting the URL is Malicious or not . Date on which the Website
URL was launched can help in detecting the website is malicious or not
6) Who is Connect:- It gives the information about the server i.e. when it is registered date of the
registration name of the registrar , i.e. all kinds of information of website
V. SYSTEM ARCHITECTURE

Figure 1: Framework for malicious URL detection


Primary target of this paper is to add to a framework which will manage the way that every time getting real
class of the example is impractical and will consider the expense of the misclassification to upgrade the
classifier in the event of endure misfortune. In proposed the online dynamic learning with cost sensitivity
(ODLCS) which will primary target of proposed framework, which is expressed previously. The target of
directed malicious URL discovery is to manufacture a prescient model that can unequivocally predict if an
15 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 7, July - 2015. ISSN 2348 4853
approaching URL sample is noxious or not[10][11][12][13][14]. If all else fails, this can be portrayed as a
binary classification errand where malicious URL examples are from positive class ("+1") and typical URL
occurrences are from negative ("-1"). For an online pernicious URL recognition responsibility, the objective is
to make an online learner to incrementally assemble a arrangement model from a gathering of URL
preparing information occasions by method for a online learning fashion. In particular, for every one
adapting round, the learner first gets another approaching URL event for location; it then applies the
classification model to anticipate in case it is malicious then again not; around the end of the adapting round,
if reality class name of the sample can be revealed from the earth, the learner will make usage of the checked
case to redesign the characterization model at whatever point the order is erroneous generally speaking, it is
normal to apply web figuring out how to comprehend online malicious URL detection. In any case, it is
unfeasible to explicitly apply a current online learning framework to settle these issues[4][5][6][8].
This is by virtue of a schedule online classification undertaking typically acknowledge the class label of every
approaching event will be revealed keeping in mind the end goal to be used to upgrade the classification
model toward the end of every learning round. Plainly it is unfathomable or exceedingly rich if the learner
queries the class name of every approaching event in an online malicious URL detection assignment. To
address this test, in the proposed framework to research a novel system of ODLCS as demonstrated in Figure
2. Generally speaking, the proposed ODLCS system tries to address two key troubles in a systematic and
synergic learning philosophy:
1. The learner must choose when it ought to query the class label of an approaching URL case.
2. How to update the classifier in the best path where there is another marked URL event.
VI. DATASET
For the experiment we take dataset from http://sysnet.ucsd.edu/projects/url/. The original data set was
created in purpose to make it somehow class-balanced. In suggested system to produce a separation by
sampling from the original data set to make it close to a more realistic distribution scenario where the
number of normal URLs is significantly larger than the number of malicious URLs.
VII. CONCLUSION
In this paper proposed a novel system of Online Dynamic Learning with Cost Sensitivity (ODLCS) to taking
care of real-world applications in the classification domain like online malicious URL recognition
undertaking. Also we extract the feature from the URL.By using this feature we classify them as positive and
negative .After training the classifier new entry of URL is tested and classify that into the malicious and
normal URL.
VIII.

REFERENCES

[1]

Jialei Wang, Peilin Zhao, and Steven C.H. Hoi, Member, IEEE, Cost-Sensitive Online Classification, VOL.
26, NO. 10, OCTOBER 2014
Peilin Zhao, Steven C.H. Hoi School of Computer Engineering Nanyang Technological University 50
Nanyang Avenue, Singapore 639798 Cost-Sensitive Online Active Learning with Application to
Malicious URL Detection August 1114, 2013
R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in
Proc. 15th ECML, Pisa, Italy, 2004, pp. 3950.

[2]

[3]

16 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 7, July - 2015. ISSN 2348 4853
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]

B. R. Bocka, Methods for multidimensional event classification: A case study using images from a
Cherenkov gamma-ray telescope, Nucl. Instrum. Meth., vol. 516, no. 23, pp. 511-528, 2004.
G. Blanchard, G. Lee, and C. Scott, Semi-supervised novelty detection, J. Mach. Learn. Res., vol. 11, pp.
29733009, Nov. 2010.
V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM CSUR, vol. 41, no. 3,
Article 15, 2009.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority oversampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321357, 2002.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, Online passive-aggressive
algorithms, J. Mach. Learn. Res., vol. 7, pp. 551585, Mar. 2006.
K. Crammer, M. Dredze, and F. Pereira, Exact convex confidence weighted learning, in Proc. NIPS,
2008, pp. 345352.
P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proc. 5th ACM
SIGKDD Int. Conf. KDD, San Diego, CA, USA, 1999, pp. 155164.
M. Dredze, K. Crammer, and F. Pereira, Confidence-weighted linear classification, in Proc. 25th
ICML, Helsinki, Finland, 2008, pp. 264271.
C. Elkan, The foundations of cost-sensitive learning, in Proc.17th IJCAI, San Francisco, CA, USA,
2001, pp. 973978.
Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Mach.
Learn., vol. 37, no. 3,pp. 277296, 1999.
C. Gentile, A new approximate maximal margin classification algorithm, J. Mach. Learn. Res., vol. 2,
pp. 213242, Dec. 2001.
S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, Online multiple kernel classification, Mach. Learn., vol. 90,
no. 2, pp. 289316, 2013.

AUTHORS PROFILE

Mr. Bharat S. Jadhav received the BE degree in Information Technology from


Pravara Rural Engineering College in 2012. During 2013-2014, he stayed at Late
Hon.D.R.Kakade Polytechnic Pimpalwandi as lecturer in Computer Technology
Department,.Now he is currently working in Tikona Digital Networks as Network
Support Engg. Also he is pursuing Master Of Engineering in Sharadchandra Pawar
College of Engineering, Dumbarwadi,Otur, University Of Pune .

Dr. S.V.Gumaste, currently working as Professor and Head, Department of


Computer Engineering, SPCOE-Dumbarwadi, Otur. Graduated from BLDE
Association's College of Engineering, Bijapur, Karnataka University, Dharwar in
1992 and completed Post- graduation in CSE from SGBAU, Amravati in 2007.
Completed Ph.D (CSE) in Engineerng & Faculty at SGBAU, Amravati. Has around 22
years of Teaching Experience.

17 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org