Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Email Authorship Identification Using Radial Basis Function

Email Authorship Identification Using Radial Basis Function

Ratings:
(0)
|Views: 59|Likes:
Published by ijcsis
Email authorship identification helps tracking fraudulent emails. This research proposes extraction on unique words from the emails. These unique words will be used as representative features to train Radial Basis function (RBF). Final weights are obtained and subsequently used for testing. The percentage of identification of email authorship depends upon number of RBF centers and the type of functional words used for training RBF. One hundred fifty authors with one hundred files from the sent folder of Enron database are considered. A total of 300 unique words of number of characters in each word ranging from 3 to 7 are considered. Training and Testing RBF are done by taking different length of words. The percentage of authorship identification ranges from 95% to 97%. Simulation shows the effectiveness of the proposed RBF network for email authorship identification.
Email authorship identification helps tracking fraudulent emails. This research proposes extraction on unique words from the emails. These unique words will be used as representative features to train Radial Basis function (RBF). Final weights are obtained and subsequently used for testing. The percentage of identification of email authorship depends upon number of RBF centers and the type of functional words used for training RBF. One hundred fifty authors with one hundred files from the sent folder of Enron database are considered. A total of 300 unique words of number of characters in each word ranging from 3 to 7 are considered. Training and Testing RBF are done by taking different length of words. The percentage of authorship identification ranges from 95% to 97%. Simulation shows the effectiveness of the proposed RBF network for email authorship identification.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/15/2011

pdf

text

original

 
 
Email Authorship Identification Using RadialBasis Function
A.Pandian
Asst.Professor (Senior Grade)Department of MCASRM University, Chennai, India pandiana@ktr.srmuniv.ac.in
 Abstract 
- Email authorship identification helpstracking fraudulent emails. This research proposesextraction on unique words from the emails. Theseunique words will be used as representative features totrain Radial Basis function (RBF). Final weights areobtained and subsequently used for testing. Thepercentage of identification of email authorship dependsupon number of RBF centers and the type of functionalwords used for training RBF. One hundred fiftyauthors with one hundred files from the sent folder of Enron database are considered. A total of 300 uniquewords of number of characters in each word rangingfrom 3 to 7 are considered. Training and Testing RBFare done by taking different length of words. Thepercentage of authorship identification ranges from95% to 97%. Simulation shows the effectiveness of theproposed RBF network for email authorshipidentification.
 Keywords: email authorship identification; word  frequency; radial basis function;
I.
 
INTRODUCTION
 
The principal objectives of author identification are to classify [Moshe 2002] the emails belonging to an author. This approach is used inforensic for author identification in malicious emails.Some of the commercial softwares like copycatchgold, jvocalize, signature stylometric system, textaz,Antconc, yoshikoder, lexico3, T-lab, wordsmithtoolsetc. use statistical methods to identify an author..These softwares uses parameters such as total number of different words, number of content words used inthe list, total number of words in the text / vocabularyitems used, vocabulary richness, mean sentencelength, mean paragraph length, mean of 2-3 letter words, mean of voxel starting words, cumulativesummation method, bigrams and many more. Theusers who intend to utilize the software for their email author identification need to choose the type of statistical analysis options that best identify author for an email and obtain the characteristics thatremains constant for large number of emails written
Dr. Md. Abdul Karim Sadiq
Ministry of Higher EducationCollege of Applied Sciences, Sohar,Sultanate of Omanabdulkarim.soh@cas.edu.om by the author. Each author follows style, which iscalled functional words. By using these functionalwords
 
and their frequencies, identification of theauthor is easy [David 2005].Authorship identification is important as thenumber of documents in internet is increasing. Theresearchers are focused on different properties of texts. There are two different properties of the textsthat are used in classification: the content of the textand the style of the author. Stylometry [Goodman2007] the statistical analysis of literary style -complements traditional literary scholarship since itoffers a means of capturing the often elusivecharacter of an author’s style [Zheng 2006] byquantifying some of its features. Most stylometry[Pavelec 2007 and Diederich 2008] studies employitems of language and most of these are lexically based.The usefulness of function words inAuthorship attribution [Diederich 2003] is examined.Experiments were conducted with support vector machine classifiers in twenty novels and-successrates above 90% were obtained. The use of functionalwords is a valid and good approach in Authorshipattribution [Koppel 2006].Stamatatos 2001 has measured a success rateof 65% and 72% in their study for authorshiprecognition, which is an implementation of multipleregression and discriminant analysis. JoachimDiederich 2003 and his collaborators conductedexperiments with support vector classifiers anddetected author with 60-80% success rates withdifferent parameters.The effect of word sequences in authorship[Abbasi 2005] attribution has been studied. Theresearchers aimed to consider both stylistic and topicfeatures of texts. In this work the documents areidentified by the set of word sequences that combinefunctional and content words. The experiments aredone on a dataset consisting of poems using naïve
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201168http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 
Bayes classifier [Peng 2004]; the researchers claimthat they achieved good results.II.
 
MATERIALS AND METHODS
2.1
 
Materials
Words of working type, action oriented,different categories of prepositions, pronouns,adjectives, adverbs, conjunctions and interjections aregiven in Table 1 to Table 3. These words are used asfiltering and as templates. When an email is analyzedfor uniqueness, the extracted features are based onlist of words presented in the tables. Hence,unnecessary words are eliminated and the number of unique words that represent an email is minimum.
TABLE 1 SAMPLE WORDS USED FOR FILTERING
Work (70)Action(524)Preposition _1 (94)Preposition_2(30)
analyze Accelerate Aboard according toannotate Accommodate About ahead of ascertain Accomplish Above as of attend Accumulate Absent as per audit Achieve Across as regards build Acquire After aside fromcalculate Act Against because of consider Activate Along close toconstruct Adapt Alongside due tocontrol Add Amid except for TABLE 2 SAMPLE WORDS USED FOR FILTERING
Preposition _3 (16)Preposition _4 (9)Pronoun(77)Adjectives(395)
as far as apart from All earlyas well as but Another abundant by means of except Any adorablein accordancewith plus anybody adventurousin addition to save Anyone aggressivein case of concerning anything agreeablein front of considering Both alertin lieu of regarding Each alivein place of worth each other amusedin point of Either ancientTABLE 3 SAMPLE WORDS USED FOR FILTERING
Adverbs (331) Conjunctions (25) Interjections (77)
Abnormally And Absolutelyabsentmindedly But AchooAccidentally For Ack Acidly Nor AgreedActually Or AhaAdventurously So AhemAfterwards Yet AhhAlmost after AhoyAlways although Alack Angrily as Alas
Work words:
To avoid misinterpretation, work words will analyze how an author writes his emailand what clarity he has in the mail. The number of work words will indicate performance task requirements in a neat, unambiguous manner byusing the work words that translate exactly what anauthor has in his mind. Action words: It indicatessome actions during an expressing in the email.Preposition, adjectives, adverbs, conjunctions andinterjections have their standard meanings.The total number of words used as basic dictionary is1648 (work + action + prepositions + adjectives +adverbs + conjunctions + Interjections). The numbersmentioned in the paranthesis are the total in eachcategory whereas, only few words are shown in thetables for understanding.A schematic diagram for implementation of the proposed work is presented din Figure 1.
Fig.1 (a) Training the systemFig.1 (b) Testing the system
Email: The email received in the systemExtract words: all the words in the email arearranged.Filter words: The words given in Table 1-3 aresearched in the extracted words. Subsequently, theword frequencies are found.Author matrix: A matrix with column as authors andvertical rows with word frequencies.Training patterns: The columns of the matrix are usedas training patterns and labeling are introduced.Emails
 
Extract wordsFilter wordsusingtemplateFind thefrequencyand thewords for eachTrain RBFand storefinalweightsCreateauthor matrixIdentifytheauthor Emails
 
Extract wordsFilter wordsusingtemplatewords givenFind thefrequency andthe words for each categoryProcesswith finalweights
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201169http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 
2.2
 
Methods2.2.1
 
 Radial Basis Function
The concept of distance measure is used toassociate the input and output pattern values. RBFsare capable of producing approximations to anunknown function ‘f’ from a set of input dataabscissa. The approximation is produced by passingan input point through a set of basis functions, eachof which contains one of the RBF centers.An exponential function is used as anactivation function for the input data. Distance between Input data and set of centers chosen from theInput data are found and passed through anexponential activation function. A bias value of f isused along with the data. These data are further  processed to get a set of final weights between radial basis function and the target value.The topology of RBF network is 12 nodes inInput layer, 10 nodes in hidden layer and 1 node inthe output layer. The difference in input data and acenter is passed through exp(-x) and is called RBF. Arectangular matrix is further obtained for whichinverse is found. The resultant value is processedwith the entire inputs and target values to obtain finalweights.Details of the Figure 2 is given below:Read input pattern: The columns of the author matrixare used as training patterns. The number of patternsis equal to number of authors.Create center: One hundred training patterns are usedas centers.Create RBF: Calculate distance between training patterns and one hundred centers. The resultantvalues are passed through activation function, exp(-x)to produce outputs of RBF nodes in the hidden layer of the network.The number of training patterns and thenumber of centers will produce a rectangular matrix.This is converted into square matrix and inverse of the same is found and processed with labeling to getfinal weights.
Fig 2 Radial basis function flow chart
III.
 
. EXPERIMENTAL PROCEDUREEnron email dataset has been used for evaluating the efficiency of RBF in email authorshipidentification. This email dataset was made public bythe Federal Energy Regulatory Commission duringits investigation. It contains all kind of emails, personal and official. William Cohen from CMU has put up the dataset on the web for researchers. Thiscontains around 5,17,431 emails from 151 users.Each mail in the folders contains the senders and thereceiver email addresses, date and time, subject, body, text and some other email specific technicaldetails. It is available in the form of MySql databasewith a size of 400MB. The Enron database containsfour tables. The first table contains information of each of the 151 employees. The second table containsthe information of the email message, the sender,subject, text and other information. The third tablecontains the recipient’s information. The fourth tablecontains information about either as a forward or reply. Table 4 presents names of few folders under each author. Only 146 authors have been consideredfor study.ReadInput patter nCreatecentersCreateRBF, rbD =det(G)FindSVD(D)G= rb
T
rbFind weightsF=E*TargetB=Inv(G)E=B*G
 G=U*W*V
T
 IsD==0?Yes No
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201170http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->