Authorship Analysis and Identification

Saurav Bose
2009042 IIIT-Delhi

Saurabh Yadav
2010077 IIIT-Delhi ABSTRACT
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. This report provides an overview of Authorship analysis and the process of Authorship identification of online messages. It explains in detail the different types of writing-style features that are extracted to build feature-based classification models for identifying authorship of online messages. It reviews the efficiency and robustness of these models in multi-language context (English, Arabic). It also discusses various limitations that exist in performing authorship analysis of Arabic messages and how we can overcome these limitations.
necessary to analyze multilingual content. Arabic has garnered specific attention in recent years for sociopolitical reasons that include possible ties between certain Middle Eastern groups and terror- ism. Arabic has morphological characteristics that pose several critical problems to current authorship analysis techniques.

2. LITERATURE REVIEW 2.1 Authorship Analysis
Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship [2]. The problem can be broken down into three sub-fields: Author Identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. Author Characterization summarizes the characteristics of an author and generates the author profile based on his/her work. Some of these characteristics include gender, educational and cultural background, and language familiarity. Similarity Detection compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author.

Authorship analysis, Authorship identification, online messages, writing style features.

With the advent of Internet, it has become easier to share information between people across time and space. These are followed by both advantages and disadvantages, the latter being opening a new venue for criminal activities, collectively known as cyber crimes. Some examples include distribution of illegal content in cyber space like pornography, pirated softwares; terrorism, hatred, etc. Of late, the cyber criminals have been extensively involved in the distribution of such illegal contents and hatred speeches via the Web-based channels, such as websites, newsgroups, forums, etc. The Internet’s feature of anonymity provides them an upper hand into performing such activities. Participating in cyber activities is an easy task as people usually do not have to provide their real identity information, such as name, address, gender, etc. As a result, it imposes complex challenges for the law enforcement agencies in criminal identity tracing. To add to their agony, we have a sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. The need of the hour is to automate criminal identity tracing in cyberspace, allowing the investigators to prioritize their tasks and focus on the major criminals. Authorship analysis can assist this activity by automatically extracting linguistic features from online messages and evaluating stylistic details for patterns of terrorist communication. However, the related work on authorship analysis techniques have mostly been on paper without much implementation in real life, particularly in online communication. Furthermore, the global nature of terrorist activity has made it

2.2 Feature Selection
Primarily, there are four writing style features that facilitate authorship attribution: syntactic, lexical, structural, and contentspecific. Syntactical features refer to the patterns used to form sentences. They consist of tools used to structure sentences, such as punctuation and function words. Example function words are while and upon. Usage patterns of function words can be effective features for authorship identification [1]. For example, the difference between using the word thus or hence might seem subtle, but it can constitute a significant stylistic difference. Lexical features can be either word- or character-based. Characteristics such as total number of words, words per sentence, vocabulary richness, etc. can be included under Wordbased lexical features. Vocabulary richness measures include the number of words that occur once (hapax legomena) and twice (hapax dislegomena), as well as several statistical measures defined by previous studies. On the other hand,

Content-specific features are words which are important within a specific topic domain. making it a style marker. characters per sentence. etc. doubling the word size. in a discussion on computers. using a special character that resembles a dash (—)[1]. [1] Diacritics are used in Arabic to represent short vowels. These features are only good for providing discriminators but not for capturing additional information present in online messages.based features such as function words. usage frequency of individual letters. When it comes to the stylistic and structural properties of a language. Al. The lack of diacritics can significantly impact the effectiveness of wordusage. thus reducing the effectiveness of many lexical features in identifying authorship.natures or on the number of paragraphs and average paragraph length. they have gained wider acceptance in authorship analysis studies in the recent years. These techniques include Support Vector Machines (SVMs). the data set was taken from Al-Aqsa Martyrs group while for English messages. it’s impossible without diacritics to distinguish between the words who and from. given a population of an author’s texts. For example. These benefits are important for working with online messages. Although readers can use the sentence semantics to decipher proper meaning.1 Inflection Arabic consists of approximately 5000 roots which are used to form words and sentences. this assumption does not hold true. In case of Arabic messages. An important statistical test was introduced by Thisted and Efron’s paper. 3. Inflection. The orthographical and morphological properties of Arabic result in significant lexical variation [1]. As a result. Drastic increase in computational power over the years has caused the Machine Learning techniques to emerge. The basic idea is that different authors have different text compositions which are characterized by a probability distribution of word usage. laptop.2 Diacritics Diacritics are the markings above or below the letters used to indicate special phonetic values [1]. because words can take on numerous forms. which involve classification of many authors and a large feature set. the usage of long complex words in a sentence shows how well versed a person is in his/ her language. For instance. These roots are themselves composed of 3-5 consonants. Word Length and Elongation are some characteristics which need to be taken care of while applying authorship analysis over Arabic messages. However. for example. and relationships between words. Holmes found that the CUSUM analysis was unreliable because the stability of those characteristics over multiple texts is not warranted. it was the White 3. Diacritics. 3.3 Word Length and Elongation Arabic words are shorter in length. Francis gave a summary of early statistical approaches used to resolve the Federalist Papers dispute. consonant lengths. Baayen proposed a linguistic evaluation of diverse statistical models of word frequency. the identification of a new text can be considered as a statistical hypothesis test or a classification problem. Brainerd used Chi. the pros being providing an important style marker and the cons being inflating the values of word length features significantly. Moreover. For example. It is a Semitic language belonging to the Afro-Asian group. 3. words like RAM. Elongation presents a further complication as at times the Arabic words are elongated for purely stylistic reasons. signatures. Arabic is a language which poses some challenges on this front. Most early work used statistical methods to facilitate authorship analysis. For example. is limited. ARABIC CHARACTERISTICS . diacritics are rarely used in online communication. Neural Networks and Decision Trees. Researchers traditionally focused on word structures such as greetings and sig. such as attributing a new text to a certain author. Structural features deal mainly with the text’s organization and layout and have proved to be important in analyzing online messages. there are some constraints for particular methods. and they are less susceptible to noisy data. so elongation is possible by lengthening the joins between letters.3 Techniques for Authorship Analysis In early studies most analytical methods used in authorship analysis were statistical methods. a diacritic is the little mark on top of the letter e in the word résumé. EXPERIMENT DESIGN The test bed for relevant messages is taken from Web Forums.Character-based lexical features include total number of characters. the use of various font sizes and colors requires a conscientious effort. But since Arabic words are almost of the same length. for example. For example. More specifically. In English. In Arabic. etc. be it easy or complex words. They provide greater scalability than statistical techniques for handling more features. word-length features are less discriminating because they are distributed over a smaller range. number of paragraphs used.squared and related distributions to perform lexical data analysis. etc would be the ones which would be heard the most. which weakens vocabulary richness measures. 4. average paragraph length. thus making it a highly inflected language. Elongation has its own pros and cons. without diacritics the words resume and résumé would look identical to a computer.though statistical methods achieved much success in authorship analysis. Inflection creates feature extraction problems owing to the larger number of possible words. the word MZKR (remind) is extended with four short dashes between the M and the Z (denoted by a faint oval). 2. For example. Examples include greetings. this isn’t feasible for an automated extraction program. the prediction capability of statistical methods. Farringdon first applied the CUSUM technique in authorship analysis. Arabic characters are combined during writing. Also.

The messages contained abundant embedded images and links relating to the war in Iraq and the treatment of Al-Qaeda prisoners.5 and SVM. IDENTIFICATION PROCESS . the collection programs take over the job of storing such messages in text and HTML format. a 30fold cross-validation testing method is used in all experiments. The White Knights content revolved around political.500 roots. The third feature set consisted of lexical.Knights group. The accuracy is a measure which indicates the overall prediction performance of a particular classifier [2]. The extraction programs have a complexity varying over different languages due to presence of special features in every language (Word Length Elongation and Inflection in case of Arabic). Both classifiers were used one at a time. These sets are formed in a step-wise manner. a filter was embedded in the Arabic feature extractor which helped in removing elongation after it had been tracked. the next step is to form feature sets. Lexical and syntactic features happen to be the most important categories and hence form the foundation for structural and contentspecific features. In case of Arabic messages. The issue of Inflection was handled using usage frequencies for a selected set of word roots. Accuracy = Number of messages with correctly identified author Total number of messages For a particular author. Their algorithm calculated and assigned similarity scores for each word against a collection of roots. The extraction of root frequencies was done by calculating similarity scores for each word against a dictionary containing more than 4. Often. 5. The absence of a feasible semantic tagger restricted tracking of Diacritics. syntactic and content-specific. Once identified. they were distributed as 79 lexical. and 15 con. In order to capture word length precisely. and the average length for the Arabic data set was 580. This compensated for the losses in vocabulary richness measures.69 words. The SVM technique zeroed upon 50 optimal roots which were often used for classification. Tracking of the root frequencies was done by a clustering algorithm designed by De Roeck and Al-Fares. recall and precision measures are used to evaluate the prediction performance. and 11 content-specific features. Out of 301 English features.5 is a powerful decision-tree-based classifier and shows a great analytical and explanatory potential in effectively assessing key differences between the English and Arabic feature sets. It has gained popularity over the years due to its massive classification power and robustness. Collection. We assigned words to the root with the highest similarity score and incremented the selected root’s usage frequency. Both English and Arabic Feature sets were formed each consisting of 301 and 418 features respectively. racial.and-error approach. Each sample of five authors was evaluated using all 20 messages per author [1]. SVM readily handles many input values owing to its capacity for dealing with noisy data. An important issue was to determine the number of roots to include in the final feature set. There were 400 messages pertaining to each language. The extraction programs further facilitate the process by deriving the writing style characteristics such as lexical. syntactic and structural features while the fourth set consisted of all the features namely: lexical. the language’s morphological and orthographical properties were taken into consideration. 45 structural. and religious issues. the second encompassed lexical and syntactic features. because previous research hasn’t yielded more definitive techniques. There were two classifier techniques put into use in the experiment: C4. structural.tent-specific features.6 words. C4. 30 randomly selected samples of five authors were selected. structural. We used a trial. The precision and recall are defined as [2]: Precision = Number of messages correctly assigned to the author Total number of messages assigned to the author Recall = Number of messages correctly assigned to the author Total number of messages written by the author 5. SVM is a computational learning method based on structural risk minimization [1]. 62 structural. syntactical and content-specific features. Extraction and Experimentation: these are the 3 main steps involved in complete online authorship identification process. Such a stepwise increment of features helps to identify the relevance of each writing style characteristic in authorship analysis. For example.2 Experiments After extracting the feature values. We tested the classification power of these roots with SVM and integrated the optimal number (50 roots) into the feature set. Accuracy. 5. In order to come up with Arabic feature set. Members commonly used profanities and advocated the use of violence against groups they disliked. 87 were lexical. 262 syntactic.1 Collection and Extraction The web forums of interest are identified with the help of spidering programs. On the other hand. as other multilingual authorship studies have done. These measures have been commonly adopted in the information retrieval and authorship analysis literature. which crawl through the Internet. For the experiment concerning English and Arabic messages. we added between 0 and 500 of the most frequently occurring roots to the complete Arabic feature set. The Al-Aqsa Martyrs group messages were mostly anti-America messages featuring lengthy arguments espousing the group’s views. The word having highest similarity score with respect to a root was assigned to that root. precision and recall measures are used to measure the effectiveness of the approach for identifying messages that were written by that author. the first set consisted of lexical features. 158 syntactic. searching for potentially dangerous or abusive contents that might relate to cybercrime and homeland security issues. To determine the number of roots to include. The average message length for the English data set was 76.

Also. ANALYSIS AND RESULTS 6. This study can be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of SVM classifier as well as finding any other feasible classifier to solve the purpose. even though they may use different identities. and navy were almost as common as black. Overall. where fonts featuring black.5 by more than 20 percent on all feature set combinations. Word Length had a major role to play in White Knights messages (40 percent) as compared to the Al-Aqsa Martyr group messages (20 percent) [1]. The messages extracted from the AlAqsa group had an overdose of various font sizes. The difference in accuracy between classifiers across Arabic messages was far greater than it was in English messages: SVM outperformed C4. For example. It also attained a better performance on the grounds of recall and precision. This number is far smaller than all other feature categories. such as bomb threats. the SVM classifier outperformed the other classifiers on all occasions. University of Arizona. too. Elongation features and nearly half the word roots were some of the important attributes that are required to be considered for future analyses. Al-Aqsa messages used a plethora of font colors and sizes. Chen.point type were a fixture. and content-specific features did not outperform using style markers and structural features [2]: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly. University of Arizona. structural features. sentence lengths of 8. This was in sharp contrast to the KKK messages.2 Classification Technique Comparison The SVM technique significantly outperformed the decision tree classifier in terms of accuracy. in that neural networks and SVM typically had better performance than decision tree algorithms. hate speeches.1 Feature Type Comparison The authorship prediction performance varied significantly with different combinations of metrics. 7. do the “persuasive” tendencies observed regarding the Al-Aqsa Martyr messages have broader applicability to other extremist Arabic groups? The current approach can also be extended to analyze the authorship of other cybercrime-related materials. All the results were obtained in percentage units signifying the importance that a particular feature set played in the analysis. Authorship Analysis in Cybercrime Investigation. Department of Management Information Systems. the impact of the different feature types for Arabic was consistent with the results we obtained on English messages. often as tools to emphasize a certain point. blue. Applying Authorship Analysis to Extremist Group Web Forum Messages. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset.6. suggesting that syntactical and structural characteristics are fairly robust feature categories across languages.5 decision tree is an effective analytical tool because of its descriptive nature. There was a significant improvement in performance when structural features were added on top of style markers. color. 6. Al-Aqsa Martyr messages were longer. CONCLUSION AND FUTURE WORK The experiments have demonstrated that with a set of carefully selected features and an effective learning algorithm. 6. The experimental results have shown a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identity. The Al-Aqsa messages tended to be considerably longer than the KKK messages. and word-based structural features was fairly consistent across both languages. The development of more complex methodologies for differentiating between a larger set of authors is an important future endeavor. The importance of punctuation. the authors of Internet newsgroup and email messages can be identified with a reasonably high accuracy.sis of English and Arabic extremist group authorship tendencies to distinguish group-level differences from linguistic disparities inherent between English and Arabic. . We also plan a more comprehensive analy. The current authorship identification methodologies are limited in the number of authors we can apply them to. Pair-wise t-test results indicated that: Using style markers and structural features outperformed using style markers only [2]: We achieved significantly higher accuracies for all three datasets (p-values were all below 0.05) by adopting the structural features. hyperlinks and embedded images. Chen 2005. Red. [2] Zheng. leading to a greater disparity in terms of feature importance in the technical structure category. and child-pornography images [2]. Huang. 10-to-12. Using style markers. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet. with the exception of the occasional deviation to green or blue. The weaker performance of content-specific features could be attributable to their smaller representation in the feature set as they contained 11 and 15 content-specific features in English and Arabic messages respectively. They require significant upward scalability to help discriminate between hundreds of potential authors. REFERENCES [1] Abbasi. The results might be explained by the fact that an author’s consistent writing patterns show up in the message’s structural features. function words. In addition to overall length. The results were consistent with previous studies.3 Decision Tree Analysis The C4.tracing problem. We can pursue several potential future directions.

A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques [4] Abbasi. Huang 2006. Department of Management Information Systems. Chen. The University of Arizona [5] [6] .[3] Zheng. Chen. 2006. Visualizing Authorship for Identification.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times