You are on page 1of 4

249

International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 2, No. 2, April 2011

Phishing E-Mail Detection Using Ontology Concept and Naïve Bayes Algorithm
Mahdi Bazarganigilani
Charles Sturt University, School of Business and Economics, Melbounre, Australia

mahdi62b@yahoo.com

Abstract : The emerging of huge phishing e-mails compels individuals to protect their valuable credentials.Invader use social engineering to create phishing e-mails. In this paper we present an algorithm for text classification of phishing e-mails using Semantic ontology concept. We use Naive Bayes algorithm for classification. Keywords-component; Phshing E-Mail Detection Ontology Concept; Spam Classification; Naïve Bayes Algorithm; Term Frequency Variance.

section, we perform our experiments and show our results.

2. Naïve Bayes Algorithm
Naïve Bayes algorithm is a classifier algorithm used in text classification. This algorithm estimates the new texts according to the previous parameters and assigns a probability to every category. The category with the most probability assigned to the new text. The problem consists of many samples. Every sample has a target function

vi ∈ V

and

a

set

of

attributes

1. Introduction
Phishing e-mails are kinds of emails used to steal important credentials of people. Usually, invaders make a website similar to popular websites and persuade victims to enter his/her credentials. Such emails sent to many users and waste valuable bandwidth. Many approaches delivered for spam identification which is based on text classification or information on headers of the emails. There are many approaches and algorithms presented. One approach is using the information on the header of the email like sender address, the numbers of receivers [1, 2] .However, such approaches are successful. But, invaders choose other ways. In [3] an algorithm based on hashing and summarizing the spam presented. In [4] the author used a method to distribute the keys and using digital signature to identify the phishing email. A usual approach for identifying phishing emails is learning and data mining methods [5, 6]. All the emails share some especial words which are more frequent in the context of the text. They used some algorithms like Naïve Bayes algorithm [6] to classify the text into des8ired categories. Following, we discus this approach. In the following sections, we discuss on Ontology concept and Naïve Bayes classifier. We use a heuristic way to detect the phshing emails. Our approach uses a two layer approach. Firstly, it tries to find the phshing email according to some common features, Furthermore if it is not recognized, it will be classified according to our proposed classifier. In the last

.The learning algorithm computes the target value for the sample and assign the best one. This maximum value obtained according to the following formula.

a1 , a 2 ,..., a n

NaiveBayse = v NB = arg max v j ∈V P(v j ) × ∏ P (ai | v j )
i

v NB represents the output of the classifier. v j shows the P (v j ) is the initiative probability of output of class j and
class

j . P(ai | v j ) is the conditional probability of v attribute i in class j .Classifier select the maximum j for

the output. For utilizing the Naïve Bayes in our spam classification, we should know how to represent the texts. Moreover, we need to compute the probabilities of Naïve Bayes formula. We consider every word as an attribute and the frequency of that as the value. Therefore, we easily apply the Naïve Bayes to our Spam classification. Moreover, we should compute the conditional probabilities and probability of every category, which is equal in our case. Since, we use the same number of training set for spam and normal emails. For finding the conditional probabilities, we define a set named vocabulary including all the distinct words in the

P (v j )

words like vehicle and concept car additionally. the quality of clustering diminishes. we declare semantic vicinity as a set of super and under concepts. including all the words. C . April 2011 training set. w P ( a i = wk | v j ) = nk + 1 n + vocabulary nk is the number of occurrences of word wk in all emails v vocabulary is the number of distinct word in in class j . function as c A reference function of words to a set of concepts. Usually. Disambiguishing Strategies Many words have different meanings and synonyms. By adding all of them. eliminating stop word. U (c) )     tf (d . We could ass the concepts in addition to the word. 3. Another strategy is the First Synonym.250 International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. We could also eliminate the words which doe not have any meaning in our lexicon. F . While by using ontology algorithms. Re f (t ) which assigns a set Hierarchy. this does not help much for disambiguating different meanings. Hierachy of the concepts The main idea of this consideration is to repeat the super concepts of a concept the same as the under concepts. motor could denote to Adding the concepts of text to text vectors has many advantages. H . different meanings and synonyms are sorted acceding to their importances and frequencies. H which represents the hierarchy of the U (c ) set words in text d . The conditional probability for every word computed as follow. A set of concepts. Obviously. For example. t ) = first   tf (d . No. . all the words −1 c { } 2. We select the first synonym which ontology offers. in ontologies and dictionaries.  c ∈ Re f c (t ) | c max imizes    Dis (d . words are all the distinct words in the new email. clustering algorithms can not find any relation between Pork and Beef. L .Therefore . n is the number of distinct words in class vj For adding the concepts of every word we could have different strategies [8]. According to the context of the text we select the best synonym for the mapping function [8]. Firstly. Last strategy is context based strategy. it solves the problem of synonym words.Compiling Background Context The main idea of compiling the background context of the text is that makes classifier capable of using the knowledge of the text in its algorithm. . Acco mod ation ) denotes Hotel is subset of Acco mod ation . it will add an upper concept like Meat and two texts will be correlated. The drawback of this method is that it exacerbates the preprocess methods of the text like stemming. among different meanings. 2.now by considering the conditional probabilities and eliminating the probabilities of every group. We use the Add strategy and eliminate the words with no concept. C . distance (b ) from set c . Define follow. words. A high level concept named ROOT which is root of every concept. all training sets. Furthermore.U (c) ) is the number of occurrences of • H (Hotel. The classifier assigns the category with the most probability to the new e-mail. 4. v NB = arg max v j ∈V i∈words ∏ P(a i |vj) In the above formula. follow [7]. Obtain O = (L. 2. Secondly. • • • • A lexicon. There are different strategies to add synonym. Firstly. we select the one which has the maximum synonyms in the underlying text. we could just replace them. . 1. For example. ROOT ) which is declared as U (c ) = Ub∈V (c ) Re f the 3. • Ontology is a notation system V (c ) = b ∈ c | c p b or c f b . a new email could be classified. adding the upper concepts makes the text vector near to main gist of the text. For example 5. One is adding all synonyms.

Finally Naïve Bayes algorithm classifies the e-mails in to spam or normal emails. IP addresses in hex format and Character misusing. the text feeded to Ontology and related concepts would be added according to the strategies. Most of attacks target limited numebr and famous companies like Pay Pal. TFV solves this problem by considering the words with high number of different distributions in the clusters.. If it does not have any phishing feature.cgi?cu_id). Words Hierachy in an ontology. 2. Firstly our algorithm perform some preprocess on raw text of e-mails. We summarize the steps. In next step we extract subject and sender address to see if it matches with phishing emails features. Such companies do not send any email with phishing features described earlier. This resulted in increase of efficiency of the classifier by eliminating the repetitive and redundant words. . If an email is from popular companies. we suppose every e-mail forged from such companies with phishing features is a phishing e-mail [9]. For computing the value of the statements in TFV since it is dependent on Clusters. c . April 2011 7. By using the above strategies.13. c } frequencies of clusters computed. Then. We sort the TFV of words and select the best scores. ci ) − mean _ tf ( f )] i =1 k 2 Figure 2. it will be sent to next layer for text classification according to Ontology and Naïve bayes classifier. Usually the phishing attacks originated from computers without fixed DNS addresses. It eliminates HTML tags and stems some words and eliminates stopping words.. The TFV of property f computed as follow: TFV ( f ) = ∑ [tf ( f .. Term Frequency Variance The feature vector is one of the most important parts of the text mining algorithms. It simply considers the importance of a word according to its frequency. Considering the clusters 1 2 k .. In this paper. 2. we enter the ontology knowledge based concepts to text vector for Naïve Bayes classifier. On the other hand. moreover. for every word firstly the Term as Frequency of every cluster and then the average of C = {c . Document Frequency is independent of the clusters. Therefore. Five steps in our proposed method. Afterwards. No. Entering the super concepts makes the text richer. On the other hand. There are many algorithms for evaluating the importance of the words like Term Frequency Variance (TFV) and Information Gain (IG). Figure 1. have a link to irrelevant places or it is based on IP addresses. This includes instandard ports.1:2234/auBank. We usually eliminate the feature space up to 10 %. Such attacks use hex based URL to distract users from the real address (like http://19. TFV applied and the vector space decreases and redundant concepts eliminate. entering lower concepts with context strategy disambiguates concepts with different synonyms [8]. we use TFV approach.251 International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. Proposed Method and Experimental Results Most of phishing emails have common attributes can be used to determine them. we performed for our algorithm. it will be considers as a phishing e-mail. 6.0.

D. (2004). we used dataset Enron-spam [10]. Paliouras. Bloehdorn. and Service.E. 40--47 (2005). Piscataway. we use precision and recall measurement metrics stated in [12]. Mashhad. The testing of index language devices. 4. 311-14. There may be instances which algorithm can not determine any category for them which reduce the Recall. In: Australian workshops on Grid computing and e-research. A. Scanlan. Androutsopoulos. Conclusions In this paper. Sh. Integrity. USA. 2. 2. Bloom. We also used 200 emails from phishing dataset. Pfleeger. Madison. USA. [3] [4] 2 ∗ Pr ecision ∗ Re call F= Pr ecision + Re call Table 2. vol. Manderson. J. USA (1998). Asgarian.. [11] phishingcorpus. S.: CatchingSpam Before it Arrives: Domain Specific Dynamic Blacklists. P. Identity. Hotho and S.. 1963. This process incurs more overhead for training the learning classifier. Tasmania. the results show improvement. This is not that important when the application of this classification is for offline purposes. Switzerland.. J. C. 2. Moaven. April 2011 For our experiments. We firstly pre-classified the emails with some common features of phishing emails. 3. no.73 % 90. for test dataset.. et al: An Open Digest-based Technique for Spam Detection.All algorithms implemented in C#. W. Thus we just include text context compilation for learning classifier.Staab. Also for phishing e-mails we used [11]. J. IEEE Press. “Learn to Detect Phishing Scams Using Learning and Ensemble Methods”. IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. 15. http://www. References [1] Cook. Obviously considering this ontology E. [2] tp Pr ecision = tp + fp tp Re call = tp + fn We also use another parameter F can be computed as follow. [10] Metsis. P. Table 1 . Our approach helps the Naïve Bayes classifier by entering ontology based concepts into training and test datasets...: Canning Spam: Proposed Solutions to Unwanted Email. Horvitz. Hohenberger. Vahidi. J.. We should be aware using ontology based approach incurs overhead to our classifier. Moreover. (In Persian) [9] A. Accessed 1 August 2010. IEEE. The Precision shows the accuracy of the algorithm while the recall represents the integrity of suggestion algorithm. Dumais.. 106–130. ”New Text Document Clustering Method based on Ontology”. We used 600 emails of them for training Naïve Bayes classifier and the rest. R.. Minaei. ACM Press. Australia (2006). Workshop on Web Security. K. 7th European Conference. [12] Cleverdon. Security & Privacy Magazine. B. Adida. Privacy and Trust. I. M.: Better Bayesian Filtering. In: DIMACS Workshop on Theft in ECommerce: Content. Saberi. Aslib Proceeding.. G. V.org/~jose/wiki/doku. we introduced a new text mining approach for identifying phishing emails.: Fighting Phishing Attacks: A Lightweight Trust Architecture for Detecting Spoofed Emails. USA. No. and Mills. Nov 2007. We considered 800 normal email (spam and non-spam). MoeinZadeh.peer computing. B. Mountain View. In: 4th international conference on peerto.: Spam Filtering with Naive Bayes – Which Naive Bayes?.87 % In our experiments. P. For evaluating the efficiency of our approach. Iran.. 2007. Sahami. as shown above the best result is gained by using context base strategy for selecting different synonyms. In Principles of Data Mining and Knowledge Discovery. S. pp. In: Spam Conference (2003). In: AAAI Workshop on Learning for Text Categorization.: A Bayesian Approach To Filtering Junk E-Mail.252 International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. Hartnett. [8] 81. We focused on text classification in our algorithm. Graham. H. California. D. Rivest. M. (2005). July ( 2006). E. PKDD 2003.35 % 86. “An Ontologybased Framework for Text Mining”. It would be an option to include this process in test datasets. .32 % 94. by entering the 3 level super and under concepts. Heckerman. Cimiano. which 150 of them devoted for training. 2003. Silicon Valley.php?id=phishingcorpus. S.Evaluating Parameters Assigned to Phishing Belonging to Phishing Not Belonging to Phishing Not assigned to Phishing compilation for test dataset and in online classification slows down the system.. NJ. In: 3rd Conference on Email and AntiSpam. November. Damiani.monkey. The results obviously show the better accuracy for ontology compilation in the text of e-mails while classification. Wisconsin. Habibi. Results Using Ontology Concept False False True True Using TFV False True True True 1 3 Depth of Hierarchy F (%) [5] [6] [7] S. published in 3th Conference on Information and Knowledge Technology. tp fp fn tn ci represents the clusters.