Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
The SVM Based Interactive tool for Predicting Phishing Websites

The SVM Based Interactive tool for Predicting Phishing Websites

Ratings: (0)|Views: 154|Likes:
Published by ijcsis
Phishing is a form of social engineering in which attackers endeavor to fraudulently retrieve the legitimate user’s confidential or sensitive credentials by imitating electronic communications from a trustworthy or public organization in an automated fashion. Such communications are done through email or deceitful website that in turn collects the credentials without the knowledge of the users. Phishing website is a mock website whose look and feel is almost identical to the legitimate website. So internet users expose their data expecting that these websites come from trusted financial institutions. Several antiphishing methods have been introduced to prevent people from becoming a victim to these types of phishing attacks. Regardless of the efforts taken, the phishing attacks are not alleviated. Hence it is more essential to detect the phishing websites in order to preserve the valuable data. This paper demonstrates the modeling of phishing website detection problem as binary classification task and provides convenient solution based on support vector machine, a pattern classification algorithm. The phishing website detection model is generated by learning the features that have been extracted from phishing and legitimate websites. A third party service called ‘blacklist’ is used as one of the feature that helps to envisage the phishing website effectively. Various experiments have been carried out and the performance analysis shows that the SVM based model outperforms well.
Phishing is a form of social engineering in which attackers endeavor to fraudulently retrieve the legitimate user’s confidential or sensitive credentials by imitating electronic communications from a trustworthy or public organization in an automated fashion. Such communications are done through email or deceitful website that in turn collects the credentials without the knowledge of the users. Phishing website is a mock website whose look and feel is almost identical to the legitimate website. So internet users expose their data expecting that these websites come from trusted financial institutions. Several antiphishing methods have been introduced to prevent people from becoming a victim to these types of phishing attacks. Regardless of the efforts taken, the phishing attacks are not alleviated. Hence it is more essential to detect the phishing websites in order to preserve the valuable data. This paper demonstrates the modeling of phishing website detection problem as binary classification task and provides convenient solution based on support vector machine, a pattern classification algorithm. The phishing website detection model is generated by learning the features that have been extracted from phishing and legitimate websites. A third party service called ‘blacklist’ is used as one of the feature that helps to envisage the phishing website effectively. Various experiments have been carried out and the performance analysis shows that the SVM based model outperforms well.

More info:

Published by: ijcsis on Nov 25, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/25/2011

pdf

text

original

 
The SVM Based Interactive tool for PredictingPhishing Websites
Santhana Lakshmi V
 
Research Scholar,PSGR Krishnammal College for WomenCoimbatore,Tamilnadu.sanlakmphil@gmail.comVijaya MSAssociate Professor,Department of Computer Science,GRG School of Applied Computer Technology,Coimbatore,Tamilnadu.msvijaya@grgsact.com
 Abstract
Phishing isa form of social engineering in whichattackers endeavor to fraudulently retrieve the legitimate user’sconfidential or sensitive credentials by imitating electroniccommunications from a trustworthy or public organization in anautomated fashion. Such communications are done through emailor deceitful website that in turn collects the credentials withoutthe knowledge of the users. Phishing website is amock websitewhose look and feel is almost identical to the legitimate website.So internet users expose their data expecting that these websitescome from trusted financial institutions. Several antiphishingmethods have been introduced to prevent peoplefrom becominga victim to these types of phishing attacks. Regardless of theefforts taken, the phishing attacks are not alleviated. Hence it ismore essential to detect the phishing websites in order to preservethe valuable data. This paper demonstratesthe modeling of phishing website detection problem as binary classification taskand provides convenient solution based on support vectormachine, a pattern classification algorithm. The phishing websitedetection model is generated by learning the features that havebeen extracted from phishing and legitimate websites. A thirdparty service called ‘blacklist’ is used as one of the feature thathelps to envisage the phishing website effectively. Variousexperiments have been carried out and the performanceanalysisshows that the SVM based model outperforms well.
 Keywords-Antiphishing, Blacklist, Classification, Machine Learning, Phishing, Prediction
I
 NTRODUCTION
Phishing is a novel crossbreed of computational intelligenceand technical attacks designed to elicit personal informationfrom the user. The collected information is then used for anumber of flagitious deeds including fraud, identity theft andcorporate espionage. The growing frequency and success of these attacks led a number of researchers and corporations totake the problem seriously. Various methodologies are adoptedat present to identify phishing websites. Maher Aburous et, al. proposes an approach for intelligent phishing detection usingfuzzy data mining. Two criteria are taken into account. URL -domain identity and Security-Encryption [1]. Ram basnet et al.adopts machine learning approach for detecting phishingattacks. Biased support vector machine and Neural Network areused for the efficient prediction of phishing websites [2]. YingPan and Xuhus Ding used anomalies that exist in the web pagesto detect the mock website and support vector machine is usedas a page classifier [3]. Anh Le, Athina Markopoulou,University of California used lexical features of the URL to predict the phishing website. The algorithms used for  prediction includes support vector machine, Online perceptron,Confidence-Weighted and Adaptive Regularization of weights[4]. Troy Ronda have designed an anti phishing tool that doesnot rely completely on automation to detect phishing. Instead itrelies on user input and external repositories of information [5].In this paper, the detection of phishing websites is modelledas binary classification task and a powerful machine-learning based pattern classification algorithm namely support vector machine is employed for implementing the model. Trainingthe features of phishing and legitimate websites helps to createthe learned model.Feature extraction method presented here is similar to theone presented in [3] [6] [7] and [8]. The features such asforeign anchor, nil Anchor, IP address, dots in page address,dots in URL, slash in page address, slash in URL, foreignAnchor in identity set, Using @ Symbol, server form handler (SFH), foreign request, foreign request URL in identity set,cookie, SSL certificate, search engine, ’Whois’ lookup, used intheir work are taken into account in this work. But some of thefeatures such as hidden fields and age of the domain areomitted since they do not contribute much for predicting the phishing website.Hidden field is similar to the text box used in HTML exceptthat the hidden box and the text within the box will not bevisible as in the case of textbox. Legitimate websites also usehidden fields to pass the user’s information from one form toanother form without forcing the users to re-type over and over again. So presence of hidden field in a webpage cannot beconsidered as a sign of being a phishing website.Similarly age of the domain specifies the life time of thewebsites in the web. Details regarding the life time of a websitecan be extracted from the ‘Whois’ database which contains theregistration information of all the users. Legitimate websites
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201158http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
have long life when compared to phishing websites. But thisfeature cannot be considered to recognize the phishing websitessince the phishing web pages that are hosted on thecompromised web server also contains long life. The article [9] provides empirical evidence according to which 75.8% of the phishing sites that are analyzed (2486 sites) were hosted oncompromised web servers to which the phishers obtainedaccess through google hacking techniques.This research work makes use of certain features that werenot taken into consideration in [6]. They are ‘Whois’ look upandserver form handler. ‘Whois’ is a request response protocolused to fetch the registered customer details from the database.The database contains the information such as primary domainname, registrar, registration date, expiry date of a registeredwebsite. The legitimate website owners are the registered usersof ‘whois’ database. The details of phishing websites will not be available in ‘whois’ database. So the existence of awebsites’ details in ‘whois’ database is an evidence for beinglegitimate. So it is essential to use this feature for identifyingthe phishing websites.Similarly in case of server form handler, HTML forms thatinclude textbox, checkbox, buttons etc are used to pass datagiven by the user to a server. Action is a form handler and isone of the attributes of form tag, which specifies the URL towhich the data should be transferred. In the case of phishingwebsites, it specifies the domain name, which embezzles thecredential data of the user. Even though some legitimatewebsites use third party service and hence may contain foreigndomain, it is not the case for all the websites. So it is cardinal tocheck the handler of the form. If the handler of a form points toa foreign domain it is considered to be a phishing website.Instead if the handler of a website refers to the same domain,then the website is considered as legitimate. Thus these twofeatures are very much essential and hope to contribute more inclassifying the website.The research work described here also seeks the usageof third party service named ‘Blacklist’ for predicting thewebsite accurately. Blacklist contains the list of phishing andsuspected websites. The page URL is checked against‘Blacklist’ to verify whether the URL is present in the blacklist.The process of identity extraction and feature extraction aredescribed in the following section and the various experimentscarried out to discover the performance of the models aredemonstrated in the rest of this paper.
I.PROPOSEDPHISHINGWEBSITE
 
DETECTIONMODEL
Phishing websites are replica of thelegitimate websites. A website can be mirrored by downloadingand using the source code used for designing the website.Before acquiring these websites, their source code is capturedand parsed for DOM objects. Identities of these websites areextracted from the DOM objects. The main phase of phishingwebsite prediction is identity extraction and feature extraction.Essential features that contribute to the detection of thecategory of the websites, whether phishing or legitimate areextracted from the URL and source code for envisaging the phishing websites accurately. The training dataset withinstances pertaining to legitimate and phishing websites isdeveloped and used for learning the model. The trained modelis then used for predicting unseen instance of a website. Thearchitecture of the system is shown in figure Figure 1.
Figure 1. System Architecture
 A.2.1 Identity Extraction
Identity of a web page is a set of words that uniquelydetermines the proprietorship of the website. Identity extractionshould be accurate for the successful prediction of phishingwebsite. In spite of phishing artist creating the replica of legitimate website, there are some identity relevant featureswhich cannot be exploited. The change in thesefeatures affectsthe similarity of the website. This paper employs anchor tag for identity extraction. Anchor tag is used to find theidentity of a web page accurately. The value of the href attribute of anchor tag has high probability of being an identityof a web page. Features extracted inidentity extraction phaseinclude META Title, META Description, META Keyword,and HREF of <a> tag.META TagThe <Meta> tag provides metadata about the HTMLdocument. Metadata will not be displayed on the page, but will be machine parsable. Meta elements are typically used tospecify page description, keywords, author of the document,last modified and other metadata. The <Meta> tag always goesinside the head element. The metadata is used by the browsersto display the content or to reload the page, search engines, or other web services.META Description TagThe Meta description tag is a snippet of HTML code thatcomes inside the <Head> </Head> section of a Web page. It isusually placed after the Title tag and before the Meta keywordstag, although the order is not important. The proper syntax for this HTML tag is“<META NAME="Description" CONTENT="Your descriptive sentence or two goes here.">”
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201159http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
The identity relevant object is the value of the contentattribute. The value of the content attribute gives brief description about the webpage. There is a greater possibility for the domain name to appear in this place.META Keyword TagThe META Keyword Tag is used to list the keywords andkeyword phrases that were targetedfor that specific page.<META NAME="keywords" content="META KeywordsTag, Metadata Elements, Indexing, Search Engines, Meta DataElements">The value of the content attribute provides keywordsrelated to the web page.HREFThe href attribute of the <a> tag indicates the destination of a link. The value of the href attribute is a URL to which theuser has to be directed. When the hyperlinked text is selected,users should be directed to the concerned web page. Phishersdo change this value. Since any change in the appearance of thewebpage may reveal the users that the websites is forged. Sothe domain name in the URL has high probability to be theidentity of the website.Once the identity relevant features are extracted, they areconverted into individual terms by removing the stop wordssuch as http, www, in, com, etc., and by removing the wordswith length less than three. Since the identity of a website is notexpected to be very small. Tf-idf weight is evaluated for eachof the keywords. The first five keywords that have high tf-idf value are selected for identity set. tf-idf value is calculatedusing the following formula.(
1
)where n
kj
is the number of occurrence of t
i
in document d
 j
and
n
kj
is the number of all terms in document d
 j
.(
2
) Where |D| is the total number of documents in a dataset,and {|dj:t
i
dj}|is the number of documents where term tiappears. To find the document frequency of a term,WebAsCorpus is used. It is a readymade frequency list. The listcontains words and the number of documents in which thewords appear. The total number of documents in which theterm appears is the term that has the highest frequency. Thehighest frequency term is assumed to be present in all thedocuments.The tf-idf weight is calculated using the following formula(
3
)The keywords that have high tf-idf weight are considered tohave greater probability of being the web page identity.
II
FEATURE EXTRACTION AND GENERATION
Feature extraction plays an important role in improving theclassification effectiveness and computational efficiency.Distinctive features that assist to predict the phishing websitesaccurately are extracted from the corresponding URL andsource code. In a HTML source code there are manycharacteristics and features that can distinguish the originalwebsite from the forged websites. A set of 17 features areextracted for each website to form a feature vector and areexplained below.
Foreign Anchor An anchor tag contains href attribute. The value of the href attribute is a URL to which the page is linked with. If thedomain name in the URL is not similar to the domain in pageURL then it is considered as foreign anchor. Presence of toomany foreign anchor is a sign of phishing website. So all thehref values of <a> tags used in the web page are examined.And they are checked for foreign anchor. If the number of foreign domain exceeds, then the feature F
1
is assigned to -1.Instead if the webpage contains minimum number of foreignanchor, the value of F
1
is 1.
 Nil Anchor  Nil anchors denote that the page is linked with no page. Thevalue of the href attribute of <a> tag will be null. The valuesthat denote nil anchor areabout: blank, JavaScript::, JavaScript:void(0), #. If these values exist then the feature F
2
is assignedthe value of –1.Instead the value of F
2
is assigned as 1.
IP AddressThe main aim of phishers is to gain lot of money with noinvestment and they will not spend money to buy domainnames for their fake website. Most phishing websites containIP address as their domain name. If the domain name in the page address is an IP Address then the value of the feature F
3
is –1 else the value of F
3
is 1.
Dots in Page AddressThe page address should not contain more number of dots.If it contains more number of dots then it is the sign of phishingURL. If the page address contains more than five dots then thevalue of the feature F
4
is -1 or else the value of F
4
is 1.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 201160http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->