You are on page 1of 89

Mark Mooij 0146196 Master Thesis in Artificial Intelligence Web Information Retrieval 08/22/12

Using improved generalizable features in domain adaptation for cross-domain sentiment classification

Supervisor: Wouter Weerkamp, ISLA, University of Amsterdam, w.weerkamp@uva.nl

Mark Mooij mark@ai-applied.nl Intelligent Systems Lab Amsterdam University of Amsterdam Science Park 904, 1098 XM Amsterdam P.O. Box 94323, 1090 GH Amsterdam The Netherlands

Abstract Sentiment classification doesn’t aim to identify the topic or factual content of a document, but aims to classify sentiments expressed in a text. The general sentiment expressed by the author towards a subject is identified by analyzing text-features within the document. Traditional machine learning techniques have been applied with reasonable success, but a severe decline in classification accuracy of such techniques has been shown when there is no proper match between the training and test-data with respect to the topic of the document (domain). Traditional approaches have favored domain-specific supervised machine learning techniques such as Naive Bayes, Maximum Entropy and Support Vector Machines. When building models to classify the sentiment of texts using machine learning techniques, extensive sets of labeled training-data from the same domain as the intended target text are often required. Gathering and labeling new datasets whenever a model is needed for a new domain is often time-consuming and difficult, especially if a dataset with labels is not available. This thesis will explore the difficulties experienced with current favored techniques (with regard to cross-domain adaptation) and propose a solution to this problem by offering a model with which high sentiment classification accuracy can be achieved without the aid of labeled datasets for the target domain. The problem of transferring domain-specific knowledge will be addressed by making use of an adapted form of the Naive Bayes classifier introduced by Tan et al. (2009). The adapted form of the Naive Bayes classifier uses generalizable features from both source and target domain. Generalizable features are features which are consistently used between domains and can thus serve as a bridge between the domains. The generalizable features are explored in three variations: using frequently co-occurring entropy measures (FCE), using a subjectivity lexicon as generalizable features and using an importance measure in combination with FCE to select improved generalizable features. Based on the selected generalizable features a domain transfer can be made between source and target domain using the EM-algorithm. EM is used to estimate the Naive Bayes probabilities in order to maximize a model with which unsupervised, cross-domain sentiment classification can be achieved.

2

Acknowledgements
First of all I would like to thank my main thesis supervisor Wouter Weerkamp for his support and comments, after each round of comments I really felt this thesis was going more into the right direction. Next, I would like to thank all other people from the University helping my along the way: Maarten de Rijke for helping me state my research questions, Manos Tsagkias and Valentin Jijkoun for pointing me in the right direction to get started. During my thesis I had constant support of my fellowstudent and long friend Bruno Jakic who's creative mind always sheds light on new arising problems. Next, my girlfriend for helping me with structure and grammar and being of constant support. Last but not least I would like to thank my father who not only supported me during my thesis but during my entire academic career. Finally this thesis has been a long process and a journey in the sense that I've been working on it on numerous locations around the world, which this thesis will always remind me about. Special thanks goes out to my diving-friends which who I visited Egypt and were of constant dubious support when trying to work in between dives on board the MY Rosetta.

3

Table of Contents
Chapter 1 Introduction...............................................................................................................................6 Research Questions...............................................................................................................................8 Chapter 2 Background.............................................................................................................................10 2.1 Identifying sentiment.....................................................................................................................10 2.2 Sentiment analysis.........................................................................................................................10 2.3 Sentiment classification.................................................................................................................11 2.3.1 Polarity.......................................................................................................................................11 2.3.2 Features......................................................................................................................................12 2.3.3 Domain dependence...................................................................................................................13 2.4 Approaches....................................................................................................................................13 2.4.1 Subjectivity lexicon based methods...........................................................................................13 2.4.2 Machine learning techniques......................................................................................................14 2.5 Cross-Domain Sentiment Classification.......................................................................................14 2.6 This research..................................................................................................................................15 Chapter 3 Models.....................................................................................................................................17 3.1 Generalizable features selection....................................................................................................17 3.1.1 Frequently Co-occurring Entropy..............................................................................................17 3.1.2 Using a subjectivity lexicon as generalizable features...............................................................19 3.1.3 Using an importance measure in combination with FCE to select generalizable features.........21 3.2.1 Naive Bayes classifier................................................................................................................22 3.2.2 Adapted Naive Bayes Classifier.................................................................................................24 3.2.3 Features......................................................................................................................................26 3.3 Lexicon-based classifier................................................................................................................26 Chapter 4 Experimental Setup.................................................................................................................28 4.1.1 Datasets......................................................................................................................................28 4.1.2 Movie review dataset..................................................................................................................29 4.1.3 Movie-scale-review dataset........................................................................................................30 4.1.4 Review-plus-rating dataset ........................................................................................................30 4.1.5 Multi-domain sentiment dataset.................................................................................................31 4.1.6 Twitter-sentiment-dataset...........................................................................................................33 4.1.7 Restaurant reviews dataset.........................................................................................................33 4.2 Data Processing.............................................................................................................................34 4.3 Experiments...................................................................................................................................36 4.3.1 Baseline in-domain experiment..................................................................................................37 4.3.2 Baseline cross-domain experiment.............................................................................................38 4.3.3 Cross-domain experiment using different generalizable features..............................................38 Chapter 5 Results.....................................................................................................................................40 5.1 In-domain baseline experiments....................................................................................................40 5.2 Cross-domain baseline experiments..............................................................................................42 5.3.1 Adapted Naive Bayes classifier using FCE................................................................................44 5.3.2 Adapted Naive Bayes classifier using subjectivity lexicon........................................................46 5.3.3 Adapted Naive Bayes classifier using improved generalizable features....................................48 4

Chapter 6 Results analysis.......................................................................................................................51 6.1 In-domain baseline experiments....................................................................................................51 6.1.2 Naive Bayes Classifier...............................................................................................................51 6.1.3 Subjectivity lexicon classifier....................................................................................................57 6.2 Cross-domain baseline experiments..............................................................................................58 6.2.1 Naive Bayes classifier................................................................................................................58 6.2.2 Subjectivity lexicon classifier....................................................................................................61 6.3.1 Adapted Naive Bayes using FCE...............................................................................................62 6.3.2 Adapted Naive Bayes using subjectivity lexicon.......................................................................69 6.3.3 Adapted Naive Bayes using improved generalizable features...................................................74 Chapter 7 Conclusions and future work...................................................................................................83 References................................................................................................................................................85

5

Chapter 1 Introduction
In recent years, increasingly large amounts of information have become available in the form of online documents. Organizing this information from a user-perspective is an ongoing challenge. A lot of research has focused on the problem of automatic categorization of texts by topic or subject matter (e.g. sports or politics). However, recent years have shown a rapid growth in the general interest in on-line discussion groups and review sites. A popular example of a review site is Amazon.com in which categorized products are presented, combined with user-generated articles reviewing these products. A crucial characteristic of this kind of user-generated information is the article's sentiment, or overall opinion towards the subject at matter - for example, whether a product review is generally positive or negative. User generated sentiment expressions can also be found in settings such as personal blogs, political debate, twitter messages, and newsgroup message boards. This is part of an ongoing trend in which people increasingly participate in online sentiment expression. This trend has led to many interesting and challenging research topics such as subjectivity detection, semantic orientation classification and review classification. Sentiment analysis aims to address complex issues such as the identification of the object of the sentiment, the detection of mixed and overlapping sentiments in a text and the identification and interpretation of sarcasm. In practice however, most research has been concerned with classifying the overall polarity of a document. In other words: is the writer expressing a positive or negative opinion. This task is called sentiment classification and has proved to be complicated and challenging enough. Therefore, the primary focus of this thesis lies within the field of general sentiment classification: identifying the overall polarity of sentiment in different kinds of documents. Sentiment classification can be a useful application for message filtering; for example, it might be useful to use sentiment information to recognize and discard “flames” (Spertus, 1997). Sentiment classification also offers valuable applications for business intelligence and recommender systems (Terveen et al., 1997; Tatemnura, 2000). In these applications texts like customer reviews, customer feedback, survey responses, newsgroups postings, weblog pages and message boards are automatically processed in order to extract summarized (sentiment) information. Marketeers can use this information to analyze customer satisfaction or public opinions of a company and/or products, making it possible to perform in-depth analysis of market trends and consumer opinions (Dave et al., 2003). Based on these insights products can be adapted, user feedback can be studied, and marketing strategies can be optimized. Organizations can also use this information to gather critical feedback about problems in newly released products or to benchmark their provided services. For this purpose, accurate systems for sentiment classification are needed that are able to interpret free-form texts given in natural language format. Human processing of such text volumes is expensive. The only realistically feasible manner of processing such amounts of data is automated clustering, summarization and classification of this data based on the expressed sentiment. Sentiment classification does not only offer useful applications for manufacturers and marketeers; consumers can use sentiment classification to research product quality or services before making an 6

informed purchase. Individuals might want to consider other people's opinions when purchasing a product or deciding to use a particular service. Algorithms which can automatically classify other people's opinions as positive or negative would make it possible to develop new search tools (Hearst, 1992) which would facilitate summarized comparison. These search tools would enable customers to browse through clustered search results according to the expressed sentiment regarding a specific topic. Apart from the aforementioned product reviews, such a search tool might provide useful information with regard to sentiment classification on broader subjects such as for example providing detailed travel destination statistics. The rapid growth of the internet as a medium allows people to express opinions on any given subject. To classify the sentiment of these documents has proves to be a challenge as sentiment can be expressed in very different ways for different subjects (domains) (Engtröm, 2004). Turney (2002) noted that the expression “unpredictable” might have a positive sentiment in a movie review (e.g. “The movie has an unpredictable plot.”), but could be negative in the review of an automobile (e.g. “The car's performance suffers from unpredictable steering.”). For this example “unpredictable” is a very domainspecific expression denoting different sentiment for different domains. In this example “great” could have alternatively been used to express positive sentiment across the different domains, making “great” a domain-independent expression. To solve domain-dependency different approaches have been attempted. Rule-based approaches can't adapt automatically to domain characteristics, for each domain specific rules would have to be provided. Lexicon-based approaches depend on lexical resources for which the polarity of words can differ depending on the context and domain they are used in. Traditional approaches favor domain-specific supervised machine learning techniques. However, previous work has revealed that classifiers trained using machine learning techniques lose accuracy when the test-data distribution is significantly different from the training-data distribution (Aue and Gamon, 2005; Whitehead and Yaeger, 2009). A difference in distribution occurs when there is no proper match between the training and test-data with respect to the topic of the document. For example, a classifier trained on movie reviews is unlikely to perform with similar accuracy on reviews of automobiles. Therefore, building models to classify the sentiment of texts using machine learning techniques requires extensive sets of labeled training-data from the same domain as the intended target text. Labeled training-data would be an example document (e.g. a review about a movie) combined with the overall expressed sentiment in the form of a label. Gathering and labeling new datasets whenever a model is needed for a new domain is often time-consuming and difficult, especially if a dataset with labels is not available. The obvious solution to perform sentiment analysis within a new domain is to collect data, annotate the data and train a domain-specific classifier. Acquiring sufficient amounts of data in itself can be a difficult task. The data for the domain at hand might be limited (e.g. consider conducting sentiment research in the domain of tennis-rackets). When enough data is available the extraction of this data might be another challenging task. The data might be available from different data sources, and in different formats and layouts (e.g. a number of internet-forums discussing pros and cons of tennis-rackets). Even extracting all comments specific to the reviewed product-feature within a consistent format might pose a challenging task in itself. Annotating the data poses another challenge: in order to create qualitative annotations at least two annotators have to analyze considerate amounts of data to develop a training set suitable to be used for a classifier. For certain more specific domains it might be hard to find annotators which have enough expertise to create proper annotations. The context-dependent nature of sentiment makes the classification task rather difficult even for humans. The level of annotator agreement for labeled training-data can be quite low (Ghosh and Koch; Rothfels and Tibshirani, 2010). This would imply that, with widely-varying domains, researchers and engineers who build sentiment classification systems need to collect and label training-data for each new domain 7

of interest. Even in the specific sub-problem of market analysis, if automated sentiment classification were to be used across a wide range of domains, the effort to label training-data for each domain may become prohibitive, especially since product features can change over time. The ability to analyze sentiments across different domains and provide solutions for domainindependent sentiment classification has become crucial. Ideally one would like to be able to quickly and cheaply customize a model to provide accurate sentiment classification for any domain. Due to the complicating factors, the amount of manually labeled training-data available for machine learning is limited to specifically studied domains. It would be quite useful if the knowledge gained from one set of training-data could be transferred for the use in another domain by performing a domain-adaptation. Therefore, in this thesis the problem of transferring domain-specific knowledge is addressed. This problem is addressed by using an adapted form of the Naive Bayes classifier to maximize a model which allows for unsupervised sentiment classification without the need of labeled data for the target domain. Research Questions This thesis repeats a strategy proven in previous work (Tan et al., 2009) and suggests two new strategies which are able to maximize a Naive Bayes model able to provide accurate cross-domain sentiment classifications. This approach lacks the need of labeled training-data for a target domain, using only the labeled training-data from the source domain and the data-distribution of the target domain. The in this thesis repeated strategy uses frequently co-occurring entropy measures (FCE) to calculate generalizable features over source and target domain in order to perform a domain-adaptation. The experiments in this thesis will show that repeating the strategy from the work by Tan et al., (2009), which was performed on Chinese datasets, on English datasets doesn't result in the generalizable features which are to be expected. Therefore the method of selecting generalizable features for English datasets is open for improvement. Therefore, two improvements on selecting generalizable features are suggested in this work. The first improvement is the use of a subjectivity lexicon instead of using frequently co-occurring entropy measures to select generalizable features. The second improvement suggests utilizing an importance measure in combination with frequently co-occurring entropy in order to improve the selection of generalizable features. Based on the generalizable features an adapted Naive Bayes model is maximized using the EM algorithm. As a result a maximized Naive Bayes models is estimated capable of unsupervised, crossdomain sentiment classification. The research in this thesis aims to answer the following questions: 1. Does maximizing a Naive Bayes model with EM using FCE to select generalizable features improve cross-domain sentiment classification in comparison to baseline cross-domain methods? 2. Does maximizing a Naive Bayes model with EM using a subjectivity lexicon as generalizable features improve cross-domain sentiment classification in comparison to baseline methods and to using FCE to select generalizable features? 8

3. Does maximizing a Naive Bayes model with EM using a combination of FCE and an importance measure to select generalizable features improve cross-domain sentiment classification in comparison to baseline methods, using FCE to select generalizable features and using a subjectivity lexicon instead as generalizable features? To evaluate the performance of the different approaches suggested in this thesis a wide variety of datasets is collected. The datasets originate from different domains and vary in size and balance (the amount of positive examples versus the amount of negative examples available). Analyzing the different conducted experiments will provide insight in the effects of size and balance on the Naive Bayes classifier. In order to answer the research questions three main experiments will be set-up. The first experiment will evaluate the performance of the standard Naive Bayes classifier for in-domain sentiment classification using a big variety of datasets. This experiment will establish the performance of in-domain training and classification as a baseline to compare cross-domain performance against. Establishing in-domain performance is important to research whether it is possible to use the different gathered datasets consistently in terms of preprocessing, training and evaluating, while performance stays comparable to previous research. The second experiment will analyze naive cross-domain classification of an evaluation-set from one domain with classifiers trained in other domains for a big variety of datasets. This experiment will establish how much the performance of the Naive Bayes classifier will decrease from in-domain to cross-domain. The third experiment will evaluate the three different approaches in order to maximize a Naive Bayes model with EM using different approaches to select generalizable features answering the research questions. In the next chapter we will take a look a related work providing an overview of the work done in sentiment classification. In the third chapter the Naive Bayes classifier is introduced and the for this research suggested adapted form of the Naive Bayes classifier is discussed. The fourth chapter will explain the data and experimental setup. In the fifth chapter the results for the three different experiments will be presented. The sixth chapter will evaluate the experiments and discuss the results. Finally, in the seventh chapter conclusions will be drawn from the findings and future work will be discussed.

9

Chapter 2 Background
2.1 Identifying sentiment In everyday life people use different kinds of communication-methods to convey different kinds of messages. Communication can be verbal or non-verbal, and verbal communication can be subdivided into oral communication (e.g. talking directly to each other or talking to each other by telephone) or written communication (e.g. sending a letter or an e-mail). Communication has different functions, which can be divided into objective (facts, information) and subjective categories (opinions, feelings). The specific form of subjective written communication with which this thesis is concerned is the polarity expressed in subjective texts (like product-reviews). An opinion, or sentiment about an object, is a subjective belief and is the result of an emotion and/or the interpretation of facts by an individual. An opinion may be supported by arguments (objective information or subjective information), but often people draw opposed opinions from the same facts. An example of a sentence stating a sentiment can be: “Peter loves using his I-Pad.” This sentence expresses Pater’s sentiment towards using an I-Pad. This statement doesn’t mean that this sentence is true or untrue. For this statement, there is no truth value implied in the expressed sentiment. An overview of the definitions linked to the notion of sentiment is provided by Pang and Lee (2008). To capture the notion of sentiment in sentiment analysis research, sentiment is defined in terms of polarity combined with the target of a particular sentiment. The field of sentiment analyses or opinion mining aims to determine the attitude of an individual (a speaker or a writer) expressing an attitude with respect to a topic or an overall contextual polarity of a source material. An attitude can be the writer's judgment or evaluation, affective state (the emotional state of the author when writing), or the intended emotional communication (the emotional effect the author wishes to communicate, as is communicated in persuasive texts for example). 2.2 Sentiment analysis The broad research field of sentiment analysis can also be referred to as subjectivity analysis, opinion mining and appraisal extraction. It is one of the applications of natural language processing (NLP), computational linguistics and text analysis with the aim to identify and extract subjective information in source materials. In sentiment analysis subjective elements in source materials are detected and studied. Subjective elements are defined as “linguistic expressions of private states in context” (Wiebe et al., 2004). Private states can be represented as single words, word-phrases or sentences. The task of polarity detection aims to detect polarity in documents: expresses a single word, word-phrases or sentence a positive, negative or neutral sentiment (expressing no sentiment at all). Another sub-task of sentiment analysis is subjectivity detection; the task of identifying subjective words, expressions, and/or sentences opposed to objective words, expressions, and/or sentences (Wiebe et al., 1999; Hatzivassiloglou and Wiebe, 2000; Riloff et al., 2003). In order to properly identify a complete text as 10

subjective or objective, syntactic analysis on each individual sentence needs to be performed. This analysis often focuses on adjectives and adverbs in sentences to determine the polarity of the sentences (Hatzivassilolou and Wiebe, 2000; Wiebe et al., 2004). Detecting explicit subjective sentences such as: “I think this is a beautiful MP3-player” is the primary focus of sentiment analysis. However, subjective statements can also be voiced in an implicit manner: “The earphone broke in two days” (Liu, 2006). Even though the last sentences doesn’t explicitly verbalize sentiment, the implicit sentiment expressed in this sentence is a negative one. Detecting implicit sentiment proves to be a more challenging task. Apart from the aforementioned tasks, even more complex problems have been studied in other sentiment research. Following the initial research by Pang et al. (2002), in which a later widely studied dataset was created, the study of movie reviews has become a prominent domain in sentiment analysis. Consequently, sentiment analysis has extended to a wider variety of domains, ranging from stock message boards to congressional floor debates (Das and Chen, 2001; Thomas et al., 2006). Research results have been deployed industrially in systems that gauge market reaction and summarize opinions from web pages, discussion boards, and blogs. Bethard et al. (2004), Choi et al. (2005) and Kim and Hovy (2006) performed research in identifying the source of the opinions expressed in sentences using different techniques. Wilson et al. (2004) focused on the strength of expressed opinions, classifying strong and weak opinions. Chklovski (2006) presented a system that aggregates and quantifies degree assessment of opinions on web pages. Going one step further than document level Hu and Liu (2004) and Popescu and Etzioni (2005) concentrated on mining and summarizing reviews by extracting opinionated sentences specified to certain product features. 2.3 Sentiment classification Sentiment classification is a sub-task of sentiment analysis that aims to determine the positive or negative sentiment of expressions (Hatzivassiloglou and McKeown, 1997; Turney, 2002; Esuli and Sebastiani, 2005). The classification of these features can be used in order to determine the overall contextual polarity of documents. Features can be defined as single words, word-phrases or sentences. The classification of the overall contextual sentiment is mostly applied to reviews, where automated systems assign a positive or negative sentiment for a whole review document (Pang et al., 2002; Turney, 2002). It is common in sentiment classification to use the output of one feature-level as input to higher feature levels to allow for more complex context-related analysis. For example, first sentiment classification can be applied to words, after which the outcome of this analysis can be utilized to evaluate word-phrases, followed by the evaluation of sentences and so on (Turney and Littman, 2003; Dave et al., 2003). Sentiment of phrases and sentences specifically has been studied by Kim and Hovy (2004) and Wilson et al. (2005). Different techniques show optimal accuracy on different levels. On word or phrase level, n-gram classifiers and lexicon based approaches are commonly used. On sentence level, Part-Of-Speech tagging (POS-Tag) is a frequently used method. 2.3.1 Polarity The classification of sentiment is often expressed in terms of polarity. In the field of sentiment analysis different approaches can make use of different notions of polarity. A number of researches split polarity classification into a binary problem: positive versus negative polarity (Pang et al., 2002; Pang and Lee, 2004). A stated opinion can be classified in either of these two categories. As such opinionated documents are labeled as either expressing an overall positive or an overall negative tonality. A lot of research in this area has focused on the sub-domain of product reviews (Turney, 2002; Pang et al., 2002; Dave et al., 2003; Hu and Liu, 2004; Popescu and Etzioni, 2005) and movie reviews (Turney, 11

2002; Pang et al., 2002). Difficulties in binary sentiment classification tasks lie in the fact that the overall sentiment sometimes is expressed in subtle details: “The acting was good, the cinematography was well done, the story was well build but overall the movie was a major disappointment.” As a whole, this sentence seems to convey a positive sentiment when viewing the number of positive adverbs used. Yet, the last part of the sentence gives the complete sentence an overall negative tonality. Contrary to approaching sentiment polarity challenges as a binary problem, sentiment polarity can also be addressed by splitting it into three categories: positive, negative and neutral. In more complex (i.e. longer) documents classifying features as “neutral” can improve classification accuracy. In binary classifications of such documents, each feature necessarily contributes to the classification. Neutral and informative statements will also be included and labeled as either positive or negative. Using a neutral category therefore, allows for omission of such expressions (Wilson et al., 2005). Polarity can also be represented as a range, for example assigning “stars” to a product. This is a popular method used by different companies to gather user feedback on products or provided services. Assigning “stars” is often applied for rating a product or service on a scale from 1-5 “stars”. The amount of “stars” assigned corresponds to the degree of appreciation for the product or service. Viewing the problem as a range, the classification task becomes a multi-class text categorization problem (Pang and Lee, 2005). Assigning ranges allows for making distinctions between the strength of a positive or negative sentiment towards a certain object. 2.3.2 Features In sentiment classification different results in terms of accuracy can be achieved using various techniques based on assigning semantic scores to different features types. Features can be n-grams (1 gram, 2 gram), word-phrases like POS-Tags or sentences. The features are assigned a value which denotes the semantic orientation of the feature. At different levels, different features can be beneficial in optimizing classification accuracy. In the related scientific field of document clustering the term frequency (TF) is used to calculate the Term Frequency - Inverse Document Frequency (TF-IDF) score to distinguish the features which are most informative in describing the document (Jones, 1972). While sentiment classification proves to be a more challenging task in comparison to document clustering, a principle similar to TF-IDF can also be applied. Following the TF-IDF principle higher informative value is assigned to unique features opposed to frequent features. Alternatively, Pang and Lee (2002) improve the performance of their sentiment classification system using feature presence instead of feature frequency and perform experiments with combinations of features: 1 gram features combined with 2 gram features and 1 gram features combined with POS-Tags. Next to feature presence and feature frequency, the relation between the position of features in a document and the expressed sentiment has been studied. Pang and Lee (2002) and Kim and Hovy (2006) extend their methods with information about the feature-position into a feature vector to increase sentiment classification accuracies. Wiebe et al. (2004) selects n-grams based on training on annotated documents to find the optimal n-gram in order to find the maximized classification accuracy. Another feature which proved to be beneficial in improving classification accuracies are POS-Tags. As mentioned before, adjectives and adverbs can be good indicators for sentiment in text. Adjective and adverb indicators have been studied by Hatzivassiliglou and Wiebe (2000) and Benamera et al. (2007) 12

among others. Turney (2002) has shown that part-of-speech patterns can help sentiment classification especially on a document level. Dealing with negated features continues to be a challenge in sentiment classification. Although n-grams are able to capture negation to some extent, difficulties in sentiment analysis tasks often arise when negations are present in the text. Following the most standard bag-of-words approach (considering a document just as a list of all the words the document consists of), sentences like: “I like the movie.” and “I don’t like the movie.” can be processed (and classified) in a very similar way as the difference between the sentences consists only of one single word. An approach using n-grams (over n=1) would be able to identify the second sentence as revealing negative sentiment but increases the amount of data necessary to capture the data. However, even the use of n-grams would not pose a solution when such a sentence would be rephrased as “I thought I would like the movie, but in the end I didn’t”, in which the negation has become so far removed from the initial statement that n-grams would not be able to realistically capture it. The approach by Pang and Lee (2008), amongst others, includes negation in document representation by appending this to the relevant term (e.g. “like-NOT”), but this approach results in incorrect behavior in sentences like: “No wonder everyone loves it”. The most successful approach towards solving these issues proved to be the utilization of Part-of-Speech techniques to prevent incorrect behavior, because POS-Tags are able to identify the syntactical structure in a sentence. 2.3.3 Domain dependence A second important aspect in sentiment analysis is the target the sentiment is expressed towards or the overall context the sentiment is expressed in: the domain. The intended target can be an object, a concept, a person, an opinion, etc. When the intended target is clear, as in movie-reviews, it is easy to identify the topic. The intended target can be very important in determining whether the expressed sentiment is positive or negative, as the connotation can differ between domains. Turney (2002) described that the expression “unpredictable” has a positive sentiment in a movie review (e.g. “The movie has an unpredictable plot.”), but has a negative sentiment in the review of an automobile (e.g. “The car's performance suffers from unpredictable steering.”). For this example “unpredictable” is a very domain-specific feature denoting different sentiment for different domains. In this example “great” could have alternatively been used to express positive sentiment across the different domains, making “great” a domain-independent feature. 2.4 Approaches A wide variety of methods and techniques has been researched to tackle the challenges in sentiment classification. There are two main techniques to solve the problems: approaches using subjectivity lexicons and machine learning techniques. In this section we will discuss both approaches. 2.4.1 Subjectivity lexicon based methods Lexicon based methods make use of a predefined lexicon to decide the semantic orientation of features. The lexicon can be defined manually or automatically. The research of Prinz (2004) makes use of a manually constructed lexicon based on psychological research such as Plutchik’s emotion model. Lexicon based methods are an intuitive way to perform sentiment classification and were one of the first methods to be tried in the field. Although lexicon based methods offer decent performance in various contexts they often give problems capturing context specific sentiment, as the aforementioned 13

example of “unpredictable”. Automated lexicons have been created in various studies. A large amount of research has been done on lexicons created with WordNet. WordNet is a large lexical database of the English language. In this database nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. WordNet has been used in research by Kim and Hovy (2004) and Hu and Liu (2005). In these studies, lexicons are automatically generated by starting with a small list of seed terms (e.g. love, like, nice). Next the annotated antonymy and synonymy properties of terms are used to group the terms into polarity based groups. Esuli and Sebastiani (2005) use a method in which they expand on the WordNet database by adding labels for polarity and objectivity to each term. Research by Denecke (2009) uses SentiWordNet to determine polarity. SentiWordNet is a lexical resource for opinion mining which assigns three sentiment scores to each synset of WordNet: positivity, negativity and objectivity. Other approaches use lexicons from OpinionFinder (Riloff et al., 2005), an English subjectivity analysis system which, among other things, classifies sentences as subjective or objective. 2.4.2 Machine learning techniques Alternatively, statistical analysis of large corpora can be done to automatically generate lexicons. Utilizing large corpora starts by selecting a suitable corpus with annotated examples. Often product review sites (like Ebay.com, RottenTomatoes.com and IMDB.com) are used for this purpose. In other approaches emoticons are used to serve as “noisy labels” to acquire large amounts of annotated examples. Read (2005) uses this approach on data gathered from Usenet, Go et al. (2009) gather data based on this principle from twitter. Once a suitable dataset has been obtained, different machine learning techniques can be used. Some of the most popular approaches are Support Vector Machines (Pang and Lee, 2002; Dave et al, 2003; Gamon, 2004; Pang and Lee, 2004; Airolodi et al., 2006, Wilson et al., 2009), Naive Bayes (Wiebe et al., 1999; Pang and Lee, 2002; Yu and Hatzivassiloglou, 2003; Melville et al., 2009), and Maximum Entropy based classifiers (Nigam et al., 1999; Pang and Lee, 2002, Go et al., 2009). Turney and Littman (2003) use word co-occurrence to infer semantic orientation of words. In their research they attempt two different techniques: Pointwise Mutual Information (PMI) (Church and Hanks, 1989) and Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997). PMI calculates word co-occurrences by using the largest corpora available: the Internet (via search engines hit counts). LSA uses a matrix factorization technique using Singular Value Decomposition to analyze the statistical relationship between words. Matsumoto et al. (2005) use text mining techniques to extract frequent word sub-sequences and dependency sub-trees from sentences in a document dataset and use them as features of support vector machines in order to perform document sentiment classification. 2.5 Cross-Domain Sentiment Classification Different solutions have been explored to tackle the challenges in cross-domain sentiment analysis. In research by Aue and Gamon (2005) sentiment classifiers have been customized to new domains with different approaches: training classifiers based on mixtures of labeled data from different domains, training an ensemble of classifiers from different domains and using small amounts of labeled data from the target domain. Blitzer et al. (2007) use the structural correspondence algorithm to perform a domain-transfer for various product reviews. In research by Ghosh and Koch a co-occurrence algorithm is used to provide a knowledge-transfer in such a way that an unsupervised classification can be made on a cross-domain basis on sentence level. Whitehead and Yaeger (2009) use an adjusted form of 14

cosine similarity between domain lexicons that can be used to predict which models will be effective in a new target domain. Furthermore, they use ensembles of classifiers which can be applied to address cross-domain issues. In research by He et al. (2011) the Joint Sentiment-topic (JST) model was used to detect sentiment and topic simultaneously from the text. The modification to the JST model by He et al. (2011) incorporates word polarity priors augmenting the original feature space with polarity-bearing topics. Impressive performance of 95% on the movie-review dataset and 90% on a multi-domain sentiment dataset were reported. Bollegala et al. (2011) describe a sentiment classification method which automatically creates a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains in order to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. In studies by Wu et al. (Wu et al., 2009a; Wu et al., 2009b) the graph ranking algorithm shows promising results. This algorithm creates accurate labels from old-domain documents and “pseudo” labels of the new domain. This cross-domain study has been applied to electronic reviews, stock reviews and hotel reviews on Chinese datasets. In work by Tan et al. (2007) old-domain labeled examples are used in combination with new-domain unlabeled data as well. The top labeled informative examples from the source domain are used to determine informative examples in the target domain to perform a domain-transfer. Selecting informative examples to achieve effective domain transfers has also been attempted in research by Liu and Zhao (2009), Wu et al. (2010) and Tan et al (2009). In Tan et al. (2009) frequently co-occurring entropy measures are used to select generalizable features, which frequently occur in both domains and have similar occurring probability. These cooccurring features are used to construct seed lists, which are used in models to accommodate a domainadaptation. In this section a number of examples have shown that recent research in cross-domain sentiment analysis tries to use similarity between source and target domain in order to train sentiment classifiers for a target domain. Similarity can be based on some measure (e.g. cosine (Whitehead and Yaeger, 2009)) , co-occurrence (Ghosh and Koch) or can be based on topics detected in the text (He et al., 2011). All these similarity-based approaches attempt to detect features which can be used to perform a domain-transfer (in various research referred to as co-occurring features, sentiment sensitive thesaurus, informative examples, frequently co-occurring features). 2.6 This research This thesis aims to perform accurate cross-domain sentiment classification by excluding domain dependent features and find and utilize domain independent features. The ultimate aim here is to find a set of domain independent features that will accommodate a suitable domain-adaptation between a wide range of domains. Based on the promising reported accuracies for Chinese domain-adaptations, this thesis builds on the research by Tan et al. (2009) where frequently co-occurring entropy measures are used to detect domain independent features called generalizable features. The work in this thesis repeats the experiments suggested by Tan et al. (2009) for a large number of different English datasets. The data sources with which this thesis is concerned stem from a wide range of domains. In order to reach a consistent approach towards these data sources, the choice for a binary overall approach was essential. Adaption of the available data sources by adding more classes (e.g. neutral or different ranges) would render the existing annotation useless. The adaptation of the available data sources using multiple categories to a binary segmentation proved possible by omitting all neutral documents. To overcome the problem of detecting whether documents hold any sentiment at all (sentiment detection) only texts known to express a sentiment towards a specific topic are considered: different reviews from 15

different domains and tweets. In this research the standard Naive Bayes classifier used among others in (Pang and Lee, 2002) is first setup as a baseline for in-domain sentiment classification for the different used datasets. Next, the Naive Bayes classifier is used to evaluate baseline naive cross-domain sentiment classification. In the research by Tan et al. (2009) an adapted form of the Naive Bayes classifier is used to maximize a model using the EM algorithm to perform domain-adaptations will be used in this research as well. In order to provide the Naive Bayes classifier with features 1 gram features, 2 gram features and combinations of 1 gram + POS-pattern features and combinations of 1 gram + 2 gram features are used. The choice to explore these feature-sets originates from the research by Pang and Lee (2002). The conducted experiments will show that the approach suggested by Tan et al. (2009) doesn't yield a set of generalizable features as could be expected when applied to English datasets. As the importance of these features is stressed the improvement of these features is the main focus of this thesis. Two improvements are suggested: using a subjectivity lexicon to serve as generalizable features and the generation of an improved set of generalizable features using a combination of FCE and an importance measure.

16

Chapter 3 Models
This chapter introduces the approach suggested by Tan et al. (2009) and the two in this research suggested improvements for domain-adaptation in order to perform cross-domain sentiment classification. The main focus of this thesis revolves around generalizable features. Generalizable features are features that occur frequently in both source and target domain and have similar occurring probability in both domains. In this chapter the generalizable features are introduced properly by first discussing the method suggested by Tan et al. (2009). This approach has shown to be useful to perform domain-adaptations in sentiment classification across different domains. In Chapter 6 will be shown that just using the methods described by Tan et al. (2009) does not yield comparable impressive accuracy results for English datasets. A possible explanation for the under-performance could be the selected generalizable features which are not comparable to the ones reported by Tan et al. (2009). In this study detailed insight is provided in the performance and behavior of the selected generalizable features on different English datasets. The main contribution of this research lies in the introduction of two alternative methods for selecting generalizable features. The newly proposed methods build on the idea of generalizable features. As a first improvement the use of a subjectivity lexicon as generalizable features is suggested. This method is explained in detail after the introducing of the generalizable features. Next, the second improvement is explained suggesting the additional use of an importance measure to improve the selected generalizable features. After the introduction of the different alternatives for selecting generalizable features the basic Naive Bayes classifier is explained in detail, which is used in the first two experiments. This is followed by the explanation of the adapted form of the Naive Bayes classifier able to perform domain-adaptations. The features used in the different evaluated Naive Bayes classifiers are 1 gram features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features. 3.1 Generalizable features selection In this section the different methods of selecting generalizable features will be introduced. First, the approach suggested by Tan et al. (2009) is discussed. Next, the alternative approach using a subjectivity lexicon is introduced. Finally the approach using an importance measure to acquire better generalizable features is explained. 3.1.1 Frequently Co-occurring Entropy As discussed in Chapter 2 expressions can have different sentimental meaning in different domains. Expressions which have the same sentimental meaning between domains are defined as nondomainspecific features. For example the feature “excellent” is considered a nondomain-specific feature, having a positive meaning in every domain. In contrast to nondomain-specific features there are domain-specific features which have a different meaning in different domains. For example the feature “unpredictable” can be positive in context of a movie-review and negative in context of a review about (the steering of) a car (Turney, 2002). As such, the features in a document can be divided into two 17

categories: domain-specific and nondomain-specific features. Furthermore, a subdivision can be made between features expressing sentiment and features expressing no sentiment (non sentimental). In this way, considering a document as a collection of features any given document can be expressed as a vector:

F=

{

domain specific

nondomain specific

{ {

sentiment (ds) nonsentiment (dns)

sentiment (nds) nonsentiment (ndns)

Given the document's representation as a feature by identifying the different features within a document it becomes possible to use the different characteristics of the features in the context of domainadaptation. In the research by Tan et al. (2009) the different features are detected and are used in order to facilitate a domain-adaptation using an adapted form of the Naive Bayes classifier (further discussed in section 3.2.2), using EM to optimize a Naive Bayes model. In order to do so both the domain specific sentiment features as well as the nondomain specific features are used. The approach uses labeled training data from a source domain (calculating the Naive Bayes probabilities for positive and negative features), and the data-distribution of the target-domain (only the feature counts). This way the target-domain is considered to be unlabeled making it possible for any unlabeled set of documents from any domain to perform a domain-adaptation. The main benefit of this approach would obviously be that no set of labeled data from a target-domain would be necessary. This would make it possible to quickly and cheaply customize domain-specific classifiers. The nondomain-specific features are called generalizable features. Generalizable features can be used as a bridge between source and target domain in order to adapted a classifier (trained in a sourcedomain) to a specific target-domain. In order to do so an adapted form of the Naive Bayes classifier is used (further discussed in section 3.2.2). In order to make distinction between domain-specific and generalizable features (domain-independent features) Tan et al. (2009) introduce the Frequently co-occurring entropy (FCE) measure. FCE expresses a measure of feature similarity between two domains based on two criteria: the frequency of occurrence in both domains and the probability of a feature occurring in both domains. This is formalized in Formula 1: f w =log ( P S (w)×P T ( w) ) ∣P S (w)−PT (w)∣+β

Formula 1: Frequently co-occurring entropy.

In Formula 1 PS (w) and PT (w) indicate the probability of a feature w in the source and in the target domain further defined in Formula 2. The β factor is introduced in order to prevent the case when PS (w)=PT (w) where the formula could lead to a zero division. This value is set to a small number.

18

( N S +α) PS ( w)= S w (∣D ∣+2×α) ( N T +α) PT ( w)= T w (∣D ∣+2×α)
Formula 2: The probability of feature w in source and target domain.

In Formula 2 N S and N T are the number of examples with word w in the source and target w w domain. ∣D S∣ and ∣D T∣ are the number of examples in the source and the target domain. To account for overflow α is set as a small number. Using this calculation for all the features in source and target domain results in a ranked list of generalizable features. The research by Tan et al., (2009) performed on Chinese datasets results in lists of generalizable features which intuitively make sense. A selection of the features from a cross-domain adaptation from a stock review to a computer review dataset from the research by Tan et al. (2009) can be found in Table 1 . not bad not good inferior to adverse unclear cannot deficient bad independent wonderful cold good benefit pleased characteristic difficulty fast quick monopolize behindhand
slow beautiful difficult challenge can deceive injure lose outstanding incapable attractive void stodgy excellent superority excellence effective natural positive obstruct

Table 1: The top 40 generalizable features between stock and computer reviews as reported by Tan et al. (2009)

What can be noticed from Table 1 is the different kinds of features which are reported. Most of the reported features are 1 gram features, but some of them are 2 gram features (e.g. not bad, not good, inferior to). As the experiments in the next chapter will show using FCE on English datasets will result in features which are not representative for the examples in Table 1. The difference between the results for English and Chinese may be explained by a number of reasons. The first two examples are complex terms in the sense that they consist of two expressed sentiments: “not” and “good”, “not” and “bad”. Tan et al. (2009) don't report any additional filtering process performed in order to acquire an index of the data nor are the used features (1 gram, 2 gram, POS-Tag) mentioned. As the reported features are translations from Chinese characters the syntactic properties of the language could be different in the sense that these features can be easier gathered compared to English. Therefore, in this research an analysis is performed on the performance of FCE on English datasets. In the next section this thesis proposes two alternatives based on the idea of generalizable features in order to acquire a set of features which are more representative for English and can be used to perform cross-domain sentiment classification. 3.1.2 Using a subjectivity lexicon as generalizable features In order to acquire improved generalizable features the first suggested improvement in this thesis suggests the use of a subjectivity lexicon as an alternative to selecting generalizable features with FCE. 19

The idea behind this suggestion is quite simple. As generalizable features are defined as features consistently used between source and target domain, these feature need to be domain-independent between domains. Instead of automatically selecting generalizable features with FCE a qualitative list of domain-independent sentimental features can also be created manually. Such manual created lists of generalizable features have been created in lexicon-based sentiment research in the form of sentiment lexicons. As described in Chapter 2 lexicon based approaches are the most basic approach towards sentiment analysis. The basic ingredients for a lexicon based sentiment classifier are two lists: one list of terms denoting positive sentiment and one list of terms denoting negative sentiment. Often these lists are formed by linguists (in general applications) or by human annotators for more task-specific settings. In this way these lists can also be viewed as manually established generalizable features. The benefit of a lexicon-based approach is that there is little discussion about the terms from which these lists consists as they are well founded with basis in linguistics. The disadvantage of lexicon-based approaches is that domain-specific knowledge can only be captured by manual annotation by an expert for the specific domain. The most often used lexicon in sentiment classification tasks is the MPQA subjectivity lexicon. In the MPQA corpus, where the MPQA subjectivity lexicon originated from, subjective expressions of varying lengths are marked, from single words to long phrases. In research by (Wilson et al., 2005) the sentiment expressions – positive and negative expressions of emotions, evaluations, and stances were extracted. The terms in the MPQA subjectivity lexicon were collected from a number of sources. Some were culled from manually developed resources. Others were identified automatically using both annotated and unannotated data. A majority of the clues were collected as part of the work reported in (Riloff et al., 2003). Processing of the MPQA subjectivity lexicon resulted in a list of 4063 negative 1 gram features and 79 negative 2 gram features. The length of the processed positive lexicons are 2242 1 gram and 61 2 gram features. In related lexicon-based research (Esuli and Sebastiani, 2005) the Mircro WNOp corpus is used. The Micro WNOp corpus uses WordNet. WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides general definitions, and records the various semantic relations between the synonym sets. The aim of WordNet is to produce a combination of a dictionary and thesaurus that is more usable in practice. The lexicon contains a total of 11788 terms, 1915 of them are labeled as positive and 2291 are labeled as negative (the remaining 7582 terms, not belonging either to the positive or negative category, can be considered to be (implicitly) labeled as objective). In this research the Micro WNOp corpus is used as an additional source in combination to the previously discussed MPQA subjectivity lexicon. The Mirco WNOp corpus let after preprocessing the available annotations from the different groups to a set of 1651 terms from which 845 are marked positive and 806 are marked negative. Combining the Mirco WNOp corpus and the MPQA subjectivity lexicon results in a total of 4365 negative 1 gram features, 138 negative 2 gram features, 2556 positive 1 gram features and 119 positive 2 gram features. The combination of these two lexicons leads to quite a large number of features considered to be domain-independent. It could be argued that the large amount of features can't all be completely domain-independent. Looking at the resulting lists of features some features are definitely open for discussion to be used as generalizable features in context of this research. Although this may be the case, these lexicons have been used in a wide variety of 20

researches for different domains. This, and the consistent use between domains justifies the use of these lists of features as a manual compiled list of generalizable features. The combination of the positive and negative terms is used as a set of generalizable features with which the EM algorithm maximizes the probabilities in a Naive Bayes model combining source and targetdomain, explained in the sections below. 3.1.3 Using an importance measure in combination with FCE to select generalizable features The second suggestion to improve the used generalizable features is the combination of FCE and an importance measure to acquire better generalizable features. This thesis introduces a method combining the by FCE acquired generalizable features with a formula expressing the importance of terms within a sentimental context. As experimentation will show, using solely FCE on the English datasets used in this research results in generalizable features which contain many common words (like articles). In order to prevent the used generalizable features to contain such terms this research introduces an importance measure able to distinct terms important in the used sentimental context. In order to produce such as measurement the term frequency (TF) is used to calculate the Term Frequency - Inverse Document Frequency (TF-IDF) score to distinguish the features which are most informative in describing the sentimental context. With the TF-IDF principle higher informative (or sentimental) value is assigned to unique features opposed to frequent features. The TF-IDF is calculated separately over both the positive and negative document collection from the source domain by the standard TF-IDF calculation:

tf ∗idf (t , d , D)=tf (t , d)×idf (t , D)
Formula 3: standard TF-IDF

In Formula 3, t is the term and d is the document. documents in the corpus. idf (t , D) Is defined as: idf (t , D)=log(

D represents the the total number of

∣D∣ ) ∣( d ∈ D :t ∈d )∣

Formula 4: standard IDF

In Formula 4 d ∈ D:t ∈d is the number of documents where the term t appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the formula to 1+d ∈ D:t ∈d . After calculating the TF-IDF scores for both the positive and negative examples from the source-domain the importance of a term is expressed by Formula 5: importance(t , c)= (tf ∗idf (t , d , c )) (tf ∗idf (t , d , P )+tf ∗idf (t , d , N )+α)

Formula 5: Importance expressed in terms of TF-IDF.

In Formula 5 the importance for term t is for class c is expressed as the TF-IDF score of term t for term t in class c divided by the sum of both TF-IDF scores, α is set to a small number in case there's no TF-IDF score for one of the classes P or N . In order to improve the importance measure further a stopword list is used to filter out common stop-words. A default stop-word list is used for this filtering. When calculating 1 gram features all stop-words are filtered out. In the case of 2 gram 21

features only one stop-word is allowed to be in the term (e.g. “very conformable” is a feature which should be kept, “I am” should be filtered out). To combine this importance measure with the earlier used FCE score the importance score is added in the standard FCE formula: Po (w) x Pn (W ) f w =log (importance (w)+ ) ∣P 0 (w)−P n (w)∣+β
Formula 6: the importance measure combined with the FCE score.

Where in the context of Formula 6 instead of t w is used to denote a term. The intuition behind this formula is that by performing an important measure expressed in TF-IDF scores over both the positive and negative data-distribution importance in terms of sentimental context is added to the formula. Combining this with the FCE measure will lead to a ranked list of improved generalizable features containing qualitative improved generalizable features containing less irrelevant terms as articles because common terms (as articles) have low importance scores. 3.2.1 Naive Bayes classifier The Naive Bayes (NB) classifier is a is one of the most widely used algorithms for document classification (McCallum and Nigam, 1998). For this research the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method is used in the first two experiments. In the multinomial model a document is considered to be an ordered sequence of terms, drawn from the same vocabulary V . The assumption is made that the lengths of the documents are independent of their class. Another assumption is made that the probability of each term in a document is independent of the term's context and position in the document. The probability of a document d being in class c therefore, is computed as: P (c∣d )∞ P( c) ∏1<k <n P ( t k∣c)
Formula 7: Multi nominal Naive Bayes Model
d

In Formula 7 P (t k∣c ) is the conditional probability of term t k occurring in a document of class c . The conditional probability P (t k∣c ) is interpreted as a measure of how much evidence t k contributes that c is the correct class. P (c ) is defined as the prior probability of a document occurring in class c . The prior probabilities are used in the case if a document doesn’t provide any evidence of one class above the other, in this case the class with the highest prior probability is chosen. 〈t 1, t 2,. .. ,t n 〉 are the tokens in d that are part of the vocabulary and are used for classification. n d is the number of such tokens in d apart from a single word, a token can be any n-gram of other feature.
d

In text classification tasks the goal is to find the best class for the document. In the case of sentiment classification NB can be used to find the best sentiment class (positive/negative) for the document. The best class in NB classification is the most likely or maximum a posteriori (MAP) class c MAP defined as:

22

̂ ̂ ̂ c MAP =argmax c∈C P (c∣d)=argmax c∈C P(c) ∏1<k<n P (t k∣c )
d

Formula 8: Maximum a priori class

̂ In Formula 8 P is used because the true values of the parameters P (c ) and P (t k∣c ) can’t be measured exact, instead these parameters are estimate based on the training set. ̂ ̂ To estimate parameters P (t k∣c ) and P (c ) the maximum likelihood estimate (MLE) is used, this is simply the relative frequency and corresponds to the most likely value of each parameter given the training data. For the priors this estimate is: ̂ P (c )= Nc N'

Formula 9: Maximum likelihood estimate.

In Formula 9 N c is the number of documents in class c and N is the total number of documents ̂ in the training set. The conditional probability P (t k∣c ) is estimated as the relative frequency of term t (where t can be any n-gram) in documents belonging to class c as: ̂ P (t k∣c )= T ct

∑t '⊂V T ct ' '

Formula 10: Conditional probability

In Formula 10 T ct is the number of occurrences of t in the training documents from class c , including multiple occurrences of a term in a document. The positional independence assumption is made here. T ct is a count of occurrences in all positions k in the documents in the training set. In other words, no different estimates for different positions are computed, for example, if a word occurs ̂ ̂ twice in a document at position k 1 and k 2 then no distinction is made: P (t k ∣c)= P (t k ∣c) .
1 2

The problem with the MLE estimate is that these estimates can result in a zero probability if a termclass combination does not occur in the training data. If all but one term-class combinations in a document yield high evidence for a certain class one single unknown term will lead to zero probability for a single term-class combination, which will lead to a zero probability for the complete document. This is a sparseness problem: The training data is never large enough to represent the frequency of previously unseen term-class combinations. To deal with this sparseness problem a popular method is to use add-one or Laplace smoothing, which simply adds one to each count. ̂ P (t∣c)= T ct+1 = T ct +1

∑t '⊂V (T c t ' +1) ∑t '⊂V (T c t ' )+B '
Formula 11: La Place Smoothing

In Formula 11 B=∣V ∣ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as an uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. 23

In Go et al. (2009) a slight variation on the multinomial naive Bayes model is described proving increases in the classification of the sentiment in tweets: P (c) ∑ P(t∣c)n (d )
i

m

PNB (c∣d )=

i=1

P (d )

Formula 12: Multinomial Naive Bayes

In Formula 12 t represents a term (which, again, can be any n-gram) and n i(d ) represents the count of feature t i in d where there are a total of m features. Experimentation showed that applying this addition only has a beneficial effect on the results for the dataset consisting of tweets. Therefore in favor of the cross-domain adaptation this suggested addition is ignored. After calculating the class ̂ probabilities P(c∣d) for each sentence individually a classification of the document d is given by the maximum number of sentences classified as class c . In case the number of sentences of each ̂ class is equal the prior P(c ) determines the document's class c . In case the priors are equal the document is randomly returned as being positive. 3.2.2 Adapted Naive Bayes Classifier In this section the version of the Naive Bayes classifier adapted with the EM algorithm for domainadaptation introduced by Tan et al. (2009) is explained. The adapted Naive Bayes classifier (ANB) takes labeled data from the source domain and considers the target domain to be unlabeled. In order to perform a domain-adaptation the assumption is made that both domains obey the same distribution over the different observed features. This implies that the combined data can be produced by a mixture model which has a one-to-one correspondence between generative mixture components and classes. In a cross-domain setting this requirement does not hold. This problem is solved by Tan et al. (2009) with an adapted form of the Naive Bayes classifier using generalizable features to initialize a Naive Bayes model in which the probabilities are maximized using the EM algorithm. The expectation maximization (EM) algorithm is a widely used method in text-classification tasks (Go et al., 2009). The EM algorithm is an iterative method for finding the maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models. The EM iterations alternate between two steps. The fist (E) step calculates the expectation of the log-likelihood evaluated using the current estimate for the parameters. Next, the (M) step computes the parameters maximizing the expected log-likelihood found in the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next iteration. To estimate the probabilities of a mixed Naive Bayes model combining data from source and target domain the issue is that the chosen generalizable features do not accurately cover the examples in the target domain. This problem is solved by Tan et al. (2009) by using a weighted Naive Bayes classifier. The weighting factor decreases the weight for the target-domain data in favor for the source-domain data after each EM iteration. In this way a local maximum likelihood parameterization for the target-domain data is estimated. The log likelihood in this regard is provided by:

l(θ | D , z )=z ik log( P (c k | θ) P(d i | c k ; θ k ))
Formula 13: Log likelihood

In Formula 13, θk indicates the parameter of mixture models for class c k . D includes the source24

domain and target-domain data. c provides the labels “0” for positive and “1” for negative. z ik is set to “0” or “1” depending which class c is used. The EM algorithm iterates to find a local maximum parameterization for l(θ | D , z ) . In the E-step the conditional probability P(c k | d i) is calculated for every example in the target domain by using the current estimate θ . In the M-step a new maximum likelihood estimate for θ using the current estimates for the examples from the target domain is computed, i.e. P(c k | d i) In the formulas below the E-step and M-step for the adapted Naive Bayes (ANB) classifier are explained: E-step: P(c k | d i)∝ P(c k ) ∏ (P(wt | c k )(N ) )
t,i

t ∈∣V∣

M-step: (1−λ) ∑ P( c k | d i )+λ P(c k )= P(w t | c k )=
i∈ D
S

∑ P(c k | d i )
i∈ D T
Tt

(1−λ)∣D ∣+λ∣D ∣ S S T (1−λ)( ηt N t ,i )+λ ( N t ,k )+1
∣V∣ S t s t,k t =1 ∣V∣ t =1

S

(1−λ)∑ (η N )+λ ∑ ( N T, k )+∣V∣ t N S, k =∑ (N S, i P(c k | d i )) t t N T, k = ∑ (N T,i P (c k | d i)) t t
i ∈D
t

Formula 14: formals for EM

In Formula 14, N S, k and N T, k are the number of appearances of word w t in the source and target t t domain class c k . λ is a parameter that controls the impact between the source-domain and the target-domain data. It changes with each iteration as follows.
Formula 15: definition of

λ=min {δ×τ ,1 } λ

In Formula 15, τ indicates the iteration step and δ is a constant which controls the strength of the update parameter λ ,δ e (0,1) . In the experiments different values for δ will be evaluated. ηS is a t constant. : 0 if w t not∈V FCE 1if wt ∈V FCE Formula 16: definition of η ηT = t In Formula 16 V FCE is the set of the generalizable features. The discussed formulas indicate that for the old-domain data, throughout all the iterations only the generalizable features are used. While for the 25

{

new-domain data, all features including domain-specific features are used. 3.2.3 Features In order to train a Naive Bayes model the classifier needs to be provided with a distribution of features over the training data. The features which we will take into account in this research are 1 gram features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features. From previous work (Hatzivassiloglou and Wiebe, 2000; Wiebe et al., 2001) it has been shown that, in contrast to using n-grams, an additional step of linguistic analysis can be beneficial for classification accuracies. In particular, adjectives have proven to be good indicators of subjective and/or evaluative sentences. A poplar method for sentiment classification tasks therefore is constructing a training corpus based on phrases containing adjectives or adverbs. In this research 1 gram features will be combined with phrases containing adjectives or adverbs. This combination will be referred to as the 1 gram + POS-pattern feature-set. A possible shortcoming of this approach is that isolated adjectives may indicate subjectivity but may not encompass the context of the semantic orientation. A popular solution to this shortcoming is to extract two consecutive words, where one member of the pair is an adjective or an adverb and the second one is a word extracted to capture the context of the particular adjective of adverb. To provide these syntactical annotations each word-type in an available text has to identified. To annotate these POS-Tags automatically a popular method is the Brill Tagger (Brill, 1994). The tags which are extracted as being indicative towards sentiment and are used as 1 gram + POS-pattern features in this research are listed in Table 2. First Word JJ RB, RBR or RBS JJ NN or NNS RB, BR or RBS Second Word NN or NNS JJ JJ JJ VB, VBD, VBN or VBG Third Word (Not Extracted) anything Not NN nor NNS Not NN nor NNS Not NN nor NNS anything

Table 2: syntactical patterns extracted as being indicative towards sentiment

The JJ tags indicate adjectives, the NN tags are nouns, the RB tags are adverbs, and the VB tags are verbs. The second pattern means that two consecutive words are extracted if the first word is an adverb and the second word is an adjective, but the third word (which is not extracted) cannot be a noun. NNP and NNPS (singular and plural proper nouns) are avoided, so that the names of the items in the data cannot influence the classification. 3.3 Lexicon-based classifier Apart from the use of a lexicon as an alternative for automatically gathered generalizable features this research uses the MPQA subjectivity lexicon for a lexicon based classifier. This classifier is used as a a baseline sentiment classifier and can also be used as a reference against the variations of automatically generated generalizable features. The reason to include a lexicon-based approach in this research in spite of the out-performance by many other methods in other research (Multinomial Naive Bayes, POSTag, SVM) is that in a cross-domain setting it might be beneficial to just stick to the most basic subjective terms instead of using statistical methods to find a great number of relevant subjective terms by analyzing (domain-specific) training data. This way it serves as a nice benchmark, especially when considering domains where few training examples are available. In case of few training examples the 26

retrieved vocabularies of features will be small and the probability that the essences of the domain is captured is small. In this case it is likely that using a subjectivity lexicon will yield better accuracy results and will serve as a better baseline to compare against. On the other hand, while analyzing greater numbers of training data over-fitting might become a problem. To compare results from different approaches in different domains to a lexicon-based approach might prove to provide a useful minimal baseline. The lexicon-based classifier is a very simple classifier, it works by counting the number of positive terms versus the number of negative terms in a given input text the subjectivity lexicon classifier decides the input text to be either positive or negative. This is expressed by Formula 17: c lexicon=argmax c∈C ( count( positive _ terms⊂input text ), count (negative _ terms⊂input text ))
Formula 17: Lexicon-based classification

In Formula 17 c lexicon denotes the class of the input text, taking the maximum of the number of positive terms in the input text and the number of negative terms in the input text.

27

Chapter 4 Experimental Setup
In this chapter the setup for the experiments to evaluate the cross-domain behavior of the different proposed classifiers is discussed. In order to properly evaluate the cross-domain performance a wide variety of datasets is used originating from a wide variety of sources and domains. In the first part of this chapter the used datasets will be described in more detail, as well as the chosen preprocessing steps taken in order to enable a consistent use of all the datasets for the classifiers. The second part of this chapter will describe the setup of the different conducted experiments and the evaluation metric chosen. 4.1.1 Datasets In order to utilize statistical methods and machine learning techniques to train sentiment analysis classifiers substantial datasets are necessary. To create these datasets, vast amount of labeled documents are necessary to be provided as training data. The availability of vast amounts of training data is often a problem, especially if context specific training corpora are necessary. The labeled data in these training corpora can be a list of terms with their polarity assigned to them, or annotated text - for example product reviews with the sentiment towards the product expressed by means of a star or a point system. Available datasets from previous research originate from a variety of domains. One of the used dataset is the movie-review dataset introduced by Pang et al. (2002). This dataset, prominent in sentiment analysis research has been used in a number of follow up researches by Pang and Lee (2008) and has been subject in a wide variety of researches: Aue et al. (2005), Turney (2002), Whitehead and Yaeger (2009), Yessenalina et al. (2010) to name a few. Some of the other datasets used in this research consists of automatically gathered Tweets or Usenet posts. These datasets were introduced by Read (2005) and Go et al. (2009). More datasets of examples expressing sentiment towards different products in different categories have been described by Whitehead an Yaeger (2009). In this research datasets containing reviews about cameras, camps, doctors, drugs, laptops, lawyers, music, radio and restaurants are used. The datasets which are selected to be used in this researched are listed below: • • • Movie-review data, Pang et al. (2002) Movie-scale review data, Pang and Lee (2005) Review-plus-rating dataset(s), Whitehead and Yaeger (2009) consisting of ◦ Digital camera reviews ◦ Summer camp reviews ◦ Reviews of physicians ◦ Reviews of pharmaceutical drugs ◦ Laptop reviews ◦ Reviews of lawyers ◦ Musical CD reviews ◦ Reviews of radio shows

28

• •

◦ Television show reviews Multi-domain sentiment dataset used (among others) in Blitzer et al. (2009) consisting out of: ◦ Apparel reviews ◦ Reviews of automobiles ◦ Reviews of baby products ◦ Reviews of beauty products ◦ Reviews of camera and photo products ◦ Reviews of cell phones and cell phone services ◦ Reviews of computer and video games ◦ DVD reviews ◦ Reviews of electronic products ◦ Gourmet food reviews ◦ Reviews of groceries ◦ Reviews of health and personal care products ◦ Jewelry and watches reviews ◦ Kitchen and housewares appliances reviews ◦ Magazine reviews ◦ Musical Cd's reviews ◦ Reviews of musical instruments ◦ Reviews of office products ◦ Reviews of outdoor living ◦ Software reviews ◦ Reviews of sports and outdoors products ◦ Tools and hardware reviews ◦ Toys and games reviews ◦ Reviews of video products Twitter-sentiment-dataset (1.8M Tweets) and annotated sentiment, Go et al. (2009) Restaurant reviews dataset, Snyder and Barzilay (2007)

The listed datasets are discussed in more detail in the following sections. 4.1.2 Movie review dataset The movie-review dataset has been made available by Pang et al. (2002) and has been used in a variety of follow up research. The movie-review domain has been chosen for two main reasons. First the domain is experimentally convenient because a lot of reviews are available in the form of on-line collections. The reviewers often summarize their overall sentiment by means of a rating indicator: a mark on a scale (from one to ten, or one to five for example) which gives a nice machine extractable annotation for the review as a whole. This way annotations are provided and this solves one of the foremost problems in gathering data: no human annotation is necessary. In research by Turney (2002) was found that movie reviews are a difficult domain for sentiment classification reporting accuracies of 65% at max. This accuracy is based on the amount of annotator agreement when performing human annotation with three different human annotators. For 29 the movie-review dataset the Internet Movie Database (IMDB) archive of the

rec.arts.movies.reviews newsgroup was used as a source. Only reviews were selected for which an explicit rating was expressed (marks on a scale). The ratings were converted into a scale of positive, negative or neutral. To prevent biasing a limit was imposed for dominant reviewers, a maximum of 20 reviews per category, per reviewer. The original movie-review dataset consisted of 752 negative and 1301 positive reviews, with a total of 144 reviewers. Different additions let to a dataset consisting of a total of 2000 annotated reviews with 1000 positive and 1000 negative annotated reviews. The reviews in this domain are all complete reviews consisting out of multiple lines of text. The average number of words for a review is 651, the average number of sentences is 33. 4.1.3 Movie-scale-review dataset In follow up research by Pang and Lee (2005) the rating-inference problem is addressed. Pang and Lee take sentiment analysis one step further in terms of expressing sentiment on a scale. This approach differs from the traditional approach which classifies text into either one of two categories: positive or negative. Instead the attempt is made to classify the actual scale of the review. Pang and Lee argue that it is often beneficial to have a more detailed annotation available than just positive or negative. This can especially be beneficial when ranking items by recommendation or comparing several reviewers' opinions. For this reason data was collected from the Internet Movie Database archive, following the same methodology as described for the movie-review dataset, resulting in four copora of movie reviews consisting of 1770, 902, 1307 and 1027 documents from the same reviewer. To attempt to classify the actual scale of the review is beyond the scope of the general cross-domain classifier attempted in this research, but the movie scale data seemed like a useful resource. The Movie-scalereview-data was preprocessed based on a 3 class annotation in order to only keep the annotated examples where “0” is annotated as negative and “2” as positive. The text annotated as “1” was ignored as these examples are expected not to state a clear positive or negative sentiment but a mixed sentiment. In contrast to the movie review dataset the sentiment scale review dataset was annotated on a line to line basis making it possible for a single review to capture multiple sentiments. For the purpose of this research each line is processed as an independent document. Processing the sentiment scale dataset let to 3091 annotated lines from reviews of which 1197 are annotated negative and 1894 positive. The average number of words for a review is 387, the average number of sentences is 21. 4.1.4 Review-plus-rating dataset In previous attempts to build a general purpose cross-domain sentiment mining model, Whitehead and Yaeger (2009) use the review-plus rating dataset. The datasets represent a wide variety of domains that could all benefit from the creation of accurate sentiment mining models. The dataset have been made available by Whitehead and Yaeger (2009) and contain the following subsets: • Camera: Digital camera reviews acquired from Amazon.com. The reviews were taken from cameras that had a large number of ratings. This dataset and the laptop review set both fall under the broader domain of consumer electronics. The camera dataset consists of 498 annotated reviews of which 250 are annotated as positive and 248 as negative. The average number of words for a review is 127, the average number of sentences is 8. Camp: Summer camp reviews acquired from CampRatingz.com. A significant number of these reviews were written by young people who attended the reviewed summer camps. The camp dataset consists of 804 reviews of which 402 are annotated as positive and 402 as negative. The average number of words for a review is 58, the average number of sentences is 4.

30

Doctor: Reviews of physicians acquired from RateMDs.com. This dataset and the lawyer review set could both be considered part of the larger “rating of people” domain. The doctor dataset consists of 1478 reviews of which 739 are annotated as positive and 739 as negative. The average number of words for a review is 46, the average number of sentences is 4. Drug: Reviews of pharmaceutical drugs acquired from DrugsRatingz.com. The drug dataset consists of 802 reviews of which 401 are annotated as positive and 401 as negative. The average number of words for a review is 74, the average number of sentences is 5. Laptop: Laptop reviews acquired from Amazon.com. Various reviews about laptops were collected from different manufacturers. The laptop dataset consists of 176 reviews of which 88 are annotated as positive and 88 as negative. The average number of words for a review is 192, the average number of sentences is 13. Lawyer: Reviews of lawyers acquired from LawyerRatingz.com. The lawyer dataset consists of 220 reviews of which 110 are annotated as positive and 110 as negative. The average number of words for a review is 46, the average number of sentences is 4. Music: Musical CD reviews acquired from Amazon.com The albums being reviewed were recently (at the time of publishing of the paper) released popular music from a wide variety of musical genres. The music dataset consists of 582 reviews of which 291 are annotated as positive and 291 as negative. The average number of words for a review is 144, the average number of sentences is 10. Radio: Reviews of radio shows acquired from Radioratingz.com. This dataset and the TV dataset had the shortest reviews on average. The radio dataset consists of 1004 reviews of which 502 are annotated as positive and 502 as negative. The average number of words for a review is 34, the average number sentences is 3. TV: Television shows reviews acquired from TVRatingz.com. These reviews were typically very short and not very detailed. The TV dataset consists of 470 reviews of which 235 are annotated as positive and 235 as negative. The average number of words for a review is 27, the average number of sentences is 3.

The reviews in the review-plus-rating dataset are all complete reviews, consisting out of a single or multiple lines of text. 4.1.5 Multi-domain sentiment dataset The multi-domain sentiment dataset has been used in several papers Blitzer et al. (2007), Blitzer et al. (2008), Dredze et al. (2008), Mansour et al. (2009) as it proved to be a useful dataset for studying multi-domain sentiment focused specifically on product reviews. The multi-domain sentiment dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that are converted into binary labels if needed. For this research reviews with a rating > 3 were labeled positive, the reviews with a rating < 3 were labeled negative, the rest of the reviews have been discarded because their polarity seemed to be to ambiguous. The different sub-domains after preprocessing, have the following properties: • • 31 Apparel reviews: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 58, the average number of sentences is 4. Reviews of automobiles: 736 annotations of which 584 positive and 152 negative. The average

• • • • •

• • • • •

• • • • • • • • •

number of words for a review is 76, the average number of sentences is 5. Reviews of baby products: 1900 annotations of which 1000 positive and 900 negative. The average number of words for a review is 104, the average number of sentences is 6. Reviews of beauty products: 1493 annotations of which 1000 positive and 493 negative. The average number of words for a review is 87, the average number of sentences is 6. Reviews of books: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 158, the average number of sentences is 9. Reviews of camera and photo products: 1999 annotations of which 1000 positive and 999 negative. The average number of words for a review is 120, the average number of sentences 8. Reviews of cell phones and cell phone services: 1023 annotations of which 639 positive and 384 negative. The average number of words for a review is 101, the average number of sentences is 7. Reviews of computer and video games: 1458 annotations of which 1000 positive and 458 negative. The average number of words for a review is 182, the average number of sentences is 11. DVD reviews: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 171, the average number of sentences is 10. Electronics: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 102, the average number of sentences is 7. Gourmet food reviews: 1208 annotations of which 1000 positive and 208 negative. The average number of words for a review is 75, the average number of sentences is 5. Reviews of groceries: 1352 annotations of which 1000 positive and 352 negative. The average number of words for a review is 66, the average number of sentences is 5. Reviews of health and personal care products: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 82, the average number of sentences is 6. Jewelry and watches reviews: 1292 annotations of which 1000 positive and 292 negative. The average number of words for a review is 60, the average number of sentences is 4. Kitchen and housewares appliances reviews: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 86, the average number sentences is 6. Magazine reviews: 1970 annotations of which 1000 positive and 970 negative. The average number of words for a review is 111, the average number of sentences is 7. Musical Cd's reviews: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 126, the average number sentences is 8. Reviews of musical instruments: 332 annotations of which 284 positive and 48 negative. The average number of words for a review is 83, the average number of sentences is 6. Reviews of office products: 431 annotations of which 367 positive and 64 negative. The average number of words for a review is 91, the average number of sentences is 6. Reviews of outdoor living: 1327 annotations of which 1000 positive and 327 negative. The average number of words for a review is 82, the average number of sentences is 6. Software reviews: 1915 annotations of which 1000 positive and 915 negative. The average number of words for a review is 127, the average number of sentences is 8. Reviews of sports and outdoors products: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 94, the average number of sentences is 6.

32

• • •

Tools and hardware reviews: 112 annotations of which 98 positive and 14 negative. The average number of words for a review is 76, the average number of sentences is 5. Toys and games reviews: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 89, the average number of sentences is 6. Reviews of video products: 2000 annotations of which 1000 positive and 1000 negative. The average number of words for a review is 142, the average number of sentences is 9.

The reviews in the multi-domain sentiment dataset dataset are complete reviews, consisting out of a single or multiple lines of text. 4.1.6 Twitter-sentiment-dataset The twitter-sentiment-dataset has been made available by Go et al. (2009). Go et al. chose for a Twitter corpus because of the magnitude of data available through the Twitter API. With the Twitter API it becomes possible to collect millions of tweets in a relatively short time. Using Twitter is an interesting option as previous research might consider thousands of training data instances but never millions. The annotation of these millions of tweets is done automatically by using noisy labels in the form of emoticons: :-) or :-( The preprocessing for the twitter-sentiment-dataset was done by removing usernames, the removal of links, the replacement of repeated letters (hhhhhhuuuuuuunnnnngggggrrrryyyyy to hhuunnggrryy), and the removal of the labels (emoticons) from the tweet. Further emoticons containing :-P are removed as the Twitter API returns tweets containing :-P on the query :-( which are not considered to be the same sentiment. Finally tweets are removed containing both a positive and a negative label: both :-( and :-) as this label can be ambiguous. The twitter-sentiment-dataset made available by the authors consists of 800,000 tweets with positive emoticons and 800,000 tweets with negative emoticons, resulting in a total of 1,600,000 training tweets. Each tweet is considered a separate document. Although limited to 140 characters a tweet can consist of multiple sentences. The average number of words for a tweet is 13, the average number of sentences is 2. 4.1.7 Restaurant reviews dataset The Restaurant reviews dataset consist of restaurant reviews available from the website http://www.we8there.com . The dataset was constructed for previous research by Higashinaka et al. (2006) for sentiment analysis research. Each review in the restaurant review dataset is accompanied by a set of ranks, each on a scale of 1-5, covering food, ambiance, service, value, and overall experience. These ranks are provided by consumers who wrote the original reviews. The corpus does not contain incomplete data points since all the reviews available on this website contain both a review text and the set of ranks. To create suitable annotations for this research the reviews were preprocessed to annotate an average grade from the 5 ranks available. An average rating < 2 was annotated as a negative review, a review with an average score > 3 was rated as a positive review. This resulted in a dataset of 1417 reviews of which 1206 were positive and 211 were negative. The average number of words for a review is 84, the average number of sentences is 6.

33

For reference purposes the used datasets in this research are listed below in Table 3.
Name senpol senscale camera camp doctor drug laptop lawyer music radio tv apparel automotive baby beauty camera_& photo cell_phones_&_services computer_&_video_games dvd electronics gourmet_food grocery health_&_personal_care jewelry_&_watches kitchen_&_housewares magazines music2 musical_instruments office_products outdoor living software sports_&_outdoors tools_&_hardware toys_&_games video books twitter_stanford restaurant Dataset Movie-review data Movie-scale review data Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Review-plus-rating dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Multi-domain sentiment dataset Twitter-sentiment dataset Restaurant reviews dataset Total Size (negative/positive) 2000 (1000/1000) 3091 (1197/1894) 498 (248/250) 804 (402/402) 1498 (739/739) 802 (401/401) 176 (88/88) 220 (110/110) 582 (291/291) 1004 (502/502) 470 (235/235) 2000 (1000/1000) 736 (152/584) 1900 (900/1000) 1493 (493/1000) 1999 (999/1000) 1023 (384/639) 1458 (458/1000) 2000 (1000/1000) 2000 (1000/1000) 1208 (208/1000) 1352 (352/1000) 2000 (1000/1000) 1292 (292/1000) 2000 (1000/1000) 1970 (970/1000) 2000 (1000/1000) 332 (48/284) 431 (64/367) 1327 (327/1000) 1915 (915/1000) 2000 (1000/1000) 112 (14/98) 2000 (1000/1000) 2000 (1000/1000) 2000 (1000/1000) 1600000 (800000/800000) 1417 (211/1206)

Table 3: Overview of the datasets used in this research, listed are the name of the subset, the dataset the subset originated from and the total size of the dataset.

4.2 Data Processing The collected corpora for this research originate from a number of different datasets. In order to be able to use all available data for both training and evaluation of the proposed classifiers the data needed to be preprocessed. With all data available in a preprocessed form it became possible to easily and quickly 34

setup experiments evaluating different approaches. An example of a typical data-entry to which all used data sources were converted is given below: "1","My daughter loves this activity center. I wish they would change two things about it. I wish we could change out the toys and it folded up" In this example, the first comma-separated value denotes the derived annotation, where “0” stands for a negative label and “1” represents a positive labeled element. In this research binary classification is used, so all data examples are converted to either be a negative or a positive example. Datasets holding more classification details (e.g. a scale between 1-5) were converted appropriately (if applicable) or ignored (if for example a mixed sentiment was expressed). The second comma-separated value consists of a string of text. The string of text can be of variable length, holding any number of sentences. In case of the stanford (twitter) dataset the string of text is rather short (as the maximum allowed length of a tweet is 140 characters) or can even be a single word. Other examples can consists of a paragraph of text, holding for example a complete elaborate paragraphed review of a restaurant. This research keeps as close as possible to using the datasets available from previous research in the originally intended form. It might be argued that splitting each sting of text up in individual sentences and annotating these could be beneficial. This is for example done for the movie-scale review dataset (senscale). In this case each sentence is handled as a separate document (a separately labeled example). If this had to be done for the rest of the datasets as well a lot of the datasets would have to be manually re-annotated. The fact that some of the longer documents in the dataset only hold a single annotation for the whole document can be problematic. For example in the movie-review dataset there are examples of paragraphs of text discussing negative points about a movie ending in the statement that overall the author had to admit liking the movie. Splitting these sentences up and annotating them individual would lead to a big number of negative labeled sentences and only a few positive labeled sentences. Instead in the moviereview dataset a single annotation is provided: a positive one. In order to be able to compare performance in this research to previous research the decision is made to keep the annotations intact and use them as true as possible to the original research. To normalize all examples used in this research the following normalization steps are performed: 1. 2. 3. 4. 5. 6. All text is lowercased. The characters ~`!@#$%^&*()_+-={}[]:;\"\\|,.<>/?0123456789 are removed. Return characters, new line characters and tab characters are removed. Multiple occurrences of spaces are normalized to a single space. Spaces before or after a sentence are removed. Tripples of letters are normalized to doubles. In the case of the twitter dataset example words like “huuuunggggryyyyy” are used, these examples are normalized to “huunggryy”.

While the data is processed during the training of the classifier each example is split up into individual words. The words are combined into the appropriate n-grams and are stored in a sentiment-vocabulary, one vocabulary for each sentiment class. While training, the classifier keeps track of size of the vocabulary, the relative frequencies and the total corpus size. In the case of the POS-pattern feature-set the POS-Tag patterns are extracted using the Brill Tagger, the patterns containing adverbs and adjectives are extracted and the same metric is followed as for n-grams. After processing all the appointed data the necessary calculations are done in order to calculate the multinomial Naive Bayes probabilities initializing the Naive Bayes model. 35

4.3 Experiments For this research three different experiments are setup in order to proper evaluate and related the crossdomain approach suggested in this thesis. In this section the different experiments are explained. The different conduced experiments are: 1. Baseline in-domain experiment: training and classifying within the same domain. 2. Baseline cross-domain experiment: training and classification cross-domain. 3. Cross-domain experiment using (different variations of) generalizable features. For all the experiments the average accuracy will be reported. The average accuracy is reported in order to compare results to previous work where only accuracies have been reported. Next to the reported average accuracy the F-measure metric is reported, which is equal to the harmonic mean of recall ρ and precision π . ρ and π are defined as follows (Yang and Liu, 1996): πi= TP i TPi ,ρ= (TPi+FP i) (TPi +FN i )

Formula 18: definitions of recall and precision.

In Formula 18, TP i (True Positives) are the number of documents classified correctly as class i . FPi (False Positives) are the number of documents that do not belong to class i but are incorrectly classified as belonging to class i . FN i (False Negatives) is the number of documents that are not classified as class i by the classifier but actually belong to class i . The F-measure values are in the range between 0 and 1. Larger F-measure values correspond to higher classification quality. The overall F-measure score for the entire classification task is computed by two different types of averages: micro-average and macro-average (Yang and Liu, 1996). Micro-averaged F-Measure In micro-averaging the F-measure is computed global over all category classifications. ρ and π obtained by summing over all individual classifications: TP = (TP+FP) are

π=

∑i=1 TPi ∑i=1 (TPi+FPi ) ∑i=1 TPi ∑i=1 (TPi+FN i )
M M M

M

TP ρ= = (TP+FN )

Formula 19: Definitions of precision and recall in terms of micro-averaged F-measure

In Formula 19 M is the number of evaluation-sets. Micro-averaged F-measure is computed as:
F( micro−averaged )= (2 π ρ) ( π+ρ)

Formula 20: micro-averaged F-measure

36

The micro-averaged F-measure gives equal weight to each document and is therefore considered as an average over all the document/category pairs. It tends to be dominated by the classifier’s performance on common categories. Macro-averaged F-Measure In macro-averaging, the F-measure is computed locally over each evaluation-set first and then the average over all evaluation-sets is taken. ρ and π are computed for each evaluation-set as in Formula 18. The F-measure for each evaluation-set i is computed and the macro-averaged F-measure is obtained by taking the average of the F-measure values for each evaluation-set: Fi = (2 πi ρi ) (πi+ρi )

F( macro−averaged)=

∑i=1 Fi
M

M

Formula 21: Macro-averaged F-measure

In Formula 21, M is the total number of evaluation-sets. Macro-average is influenced more by the classifier’s performance on rare categories. For the experiments both measurement are provided to gain insight. In the following sections the different experiments will be discussed in more detail. 4.3.1 Baseline in-domain experiment The baseline in-domain experiment is setup in order to evaluate the general approach taken in this research. As this research uses a wide variety of datasets which are all preprocessed in a general way it makes sense to first evaluate whether approaching all data in a general way can be done without losing performance. For this purpose both the Naive Bayes and subjectivity lexicon classifier are used to perform in-domain classification. For the in-domain experiment training (in case of the Naive Bayes classifier) and evaluation will be done within the same domain. In order to perform these experiments it makes sense to repeat approaches from previous research (Pang and Lee, 2002; Blitzer et al., 2007). For a number of datasets used in this research extensive in-domain research has been done. This enables comparisons of the in-domain performance of those datasets against reported performance in previous work. The differences between the two will be evaluated, where no dramatic performance differences should be reported. As is usual in previous research (Pang and Lee 2002; Blitzer et al., 2007) the experiments will consists of 10-fold cross validations. For the in-domain experiment the training and test-sets are chosen within the same domain. Experimentation showed some performance problems in the calculation of the priors used in the Naive Bayes classifier. This was caused by the order some of the datasets were made up in. Therefore, the choice is made to perform an additional randomization of the datasets before the 10-fold sets were created. For each domain two classifiers are experimented with: the Naive Bayes classifier and the subjectivity lexicon classifier. The Naive Bayes classifier is tested with four different feature-sets: 1 gram features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features. This results in a total of five sets of performance measures for each domain. For each classifier and feature-set the average accuracy is reported, together with the standard deviation. The standard deviation is reported to gain insight in the stability of the classifiers over the different datasets. Next to 37

the reported average accuracy the micro F1 score and the macro F1 score are reported in order to evaluate influence of the balance in the training-set (the amount of positive vs the amount of negative examples available). In the next chapter confusion matrices for a couple of representative domains are provided to gain more insights in the performance. In addition more detailed results showing all 10 results of the 10-fold cross validations are reported to give a more detailed overview of the performance. In Chapter 5, section 1 a table with all results will be presented, in Chapter 6, section 1 the results will be discussed for both classifiers and all feature-sets together with a more detailed analysis of the used features. 4.3.2 Baseline cross-domain experiment In the second set of experiments a baseline is established for cross-domain sentiment classification. To evaluate cross-domain performance of the Naive Bayes classifier the most straight forward and naive approach towards cross-domain classification is chosen: The Leave One Out Approach (LOO) (Blitzer et al. 2007). For each (target) domain classifications are made with all other individual in-domain trained classifiers (using all available data for that domain). This results in a set of accuracies, micro F1 and macro F1 scores: one for each used in-domain trained classifier. The scores are averaged over all used classifiers. Reported are the average accuracy over all the available in-domain trained classifiers. The performance of the subjectivity lexicon classifier is not evaluated in this experiment as the results for cross-domain analysis of the subjectivity lexicon classifier are the same as the results for in-domain. As for the in-domain experiment four different feature-sets are used: 1 gram features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features. For each feature-set the average accuracy together with the standard deviation is reported in order to compare and gain insight in the stability of the classifier over the different domains. As in the first experiment, micro and macro F1 scores are reported. To provide more detail for a number of representative domains the results of the cross-domain performance of different in-domain trained classifiers are provided in terms of accuracy and micro and macro F1 scores. As for the previous experiment the general setup using a wide variety of datasets can be evaluated. This can be done by comparing the results of the performance of subsets of results to related cross-domain research (Blitzer et al. 2007). The comparison to related research is somewhat limited as no other research before has combined all these different available datasets. Where possible subsets of the results will be compared to previous work. The main insight from these results will be the comparison to the performance observed in the in-domain experiment. From this comparison the difference in performance can be established from in-domain to crossdomain classification. This difference in performance can be related to the two other experiments described in the next two sections. In the second section of Chapter 5 a table containing all results for the naive cross-domain approach will be presented. In Chapter 6 the results for the different feature-sets will be discussed. 4.3.3 Cross-domain experiment using different generalizable features As a third set of experiments the different approaches concerning generalizable features are explored. This experiment will contain three sub-experiments. First, the proposed approach by Tan et al. (2009) is followed using FCE to select generalizable features. For each available domain a domain-adaptation is performed calculating generalizable features between source and target domain and fitting the domain using the EM algorithm as described in the previous chapter. For every domain considered all the other available domains are individually adapted using all data from source domain and the data-distribution of the target domain. The results for all the different domain-adaptation are averaged resulting in a single set of performance measures for each domain. The performance measures include the average 38

accuracy, micro F1 and macro F1 scores. The experiments are performed using the adapted Naive Bayes classifier using the EM algorithm to fit source and target domain. As before different featuressets are considered: 1 gram features, 2 gram features and combinations of 1 gram + POS-pattern features and combinations of 1 gram + 2 gram features. The experiments will be evaluated against the performance of the previously discussed naive approach towards cross-domain classification. Further, the results are compared to the results reported in Tan et al. (2009) on Chinese datasets. As will be shown the comparison to Chinese datasets is other than expected. To gain better understanding in the differences examples will be provided showing the acquired generalizable features and the performance results for a few representative domain-adaptations. As a second approach in this set of experiments the selection of the generalizable features are explored. The first considered alternative towards acquiring better generalizable features is to use the subjectivity lexicon from the previous experiments instead of generalizable features selected by FCE. So, instead of selecting generalizable features based on FCE the subjectivity lexicon is directly used to attempt a domain-adaptation. With these generalizable features, again the adapted Naive Bayes classifier using EM is used to fit source and target domain. Consequently the same approach is followed as for using FCE as generalizable features, performing the same domain-adaptations and reporting the same performance measures. The results can be compared to the approach using FCE to select generalizable features and to naive cross-domain classification. Third, as a second improvement an importance measure is introduced to be used in combination with FCE. The importance measure is used to optimize the generalizable features acquired through FCE. As before, with this set of improved generalizable features adapted Naive Bayes using EM is used to fit source and target domain. The rest of the approach is followed as for the previous two sub-experiments. The results are directly compared to the previous two experiments. As an addition, for a number of representative domains the improvements in the used generalizable features are presented. In the third section of Chapter 5 the results for the three aforementioned experiments will be presented in the form of three tables with results; one for each sub-experiment with different used generalizable features. In the third section of Chapter 6 the results of the different sub-experiments will be discussed providing analysis of the different used generalizable features and performance results for a few representative domain-adaptations.

39

Chapter 5 Results
In this chapter the results for the three conducted experiments are presented. First the results for the indomain baseline experiments are presented, next the results for the naive cross-domain baseline experiments are presented. The third section of results presents the three variations of using generalizable features for domain-adaptation. 5.1 In-domain baseline experiments The first experiment performed in this research is the evaluation of the in-domain performance of the standard Naive Bayes classifier. This evaluation is done to set a baseline for in-domain performance with a general setup, wider used in the experiments conducted in this thesis. The general setup consistently uses the data from 38 different datasets. In order to evaluate this general setup the Naive Bayes classifier is used in order to relate the results observed in this experiment to results reported in previous in-domain sentiment analysis work. The results for the in-domain experiments are presented in Table 4.

40

Dataset

1 gram 2 gram 1 gram + POS 1 gram + 2 gram Subjectivity Lexicon ac micro macro ac micro macro ac micro macro ac micro macro ac micro macro kitchen&housewares 0.7730 0.7913 0.7908 0.7712 0.7993 0.8005 0.7630 0.7801 0.7803 0.7392 0.7844 0.7853 0.5746 0.6508 0.6501 toys_&_games 0.7695 0.8014 0.8009 0.7535 0.7948 0.7952 0.7543 0.7855 0.7855 0.7274 0.7794 0.7797 0.5750 0.6709 0.6698 sports_&_outdoors 0.7873 0.7916 0.7911 0.7852 0.7933 0.7930 0.7826 0.7821 0.7815 0.7774 0.7776 0.7770 0.5821 0.6566 0.6556 electronics 0.7692 0.7841 0.7837 0.7788 0.7875 0.7873 0.7727 0.7745 0.7748 0.7561 0.7793 0.7793 0.5704 0.6540 0.6535 beauty 0.7821 0.8609 0.8605 0.8105 0.8635 0.8625 0.7777 0.8570 0.8565 0.8027 0.8533 0.8524 0.6128 0.7381 0.7365 office_products 0.8780 0.9209 0.9204 0.7021 0.8235 0.8236 0.8720 0.9155 0.9150 0.6441 0.7680 0.7669 0.6850 0.8059 0.8048 automotive 0.7983 0.8909 0.8904 0.7732 0.8513 0.8485 0.7930 0.8846 0.8838 0.6971 0.7891 0.7869 0.6877 0.8088 0.8080 laptop 0.7438 0.7753 0.7760 0.6774 0.6905 0.6990 0.7062 0.7209 0.7187 0.6289 0.6093 0.6225 0.5643 0.6455 0.6347 jewelry&watches 0.8049 0.8829 0.8823 0.8265 0.8926 0.8915 0.7965 0.8814 0.8808 0.7992 0.8738 0.8730 0.7233 0.8300 0.8289 stanford 0.6774 0.7324 0.7324 0.6790 0.7113 0.7084 0.6738 0.7082 0.7055 0.6787 0.7113 0.7084 0.5614 0.6620 0.6620 senpol 0.7880 0.7633 0.7621 0.8123 0.8053 0.8052 0.7779 0.7717 0.7716 0.7819 0.7584 0.7565 0.4963 0.6628 0.6612 lawyer 0.8093 0.8367 0.8379 0.7127 0.7895 0.7886 0.6024 0.7138 0.7127 0.6650 0.7581 0.7554 0.6245 0.6992 0.6957 baby 0.7622 0.7959 0.7961 0.7647 0.8077 0.8071 0.7497 0.7923 0.7915 0.7615 0.8031 0.8027 0.5827 0.6839 0.6824 tools&hardware 0.9649 0.9744 0.9741 0.9901 0.9787 0.9778 0.4693 0.5816 0.5590 0.8705 0.9333 0.9309 0.7197 0.8449 0.8416 music2 0.7599 0.7339 0.7336 0.7439 0.7560 0.7557 0.7400 0.7365 0.7358 0.7182 0.7026 0.7012 0.5542 0.6550 0.6543 computer&videogames 0.8489 0.8948 0.8943 0.9179 0.9325 0.9322 0.8426 0.8927 0.8919 0.8805 0.9046 0.9043 0.6569 0.7576 0.7565 senscale 0.8644 0.8870 0.8869 0.8233 0.8633 0.8632 0.8272 0.8644 0.8643 0.8162 0.8589 0.8588 0.6053 0.7497 0.7496 cell phones&service 0.7767 0.8312 0.8292 0.7991 0.8437 0.8419 0.7768 0.8259 0.8229 0.7904 0.8261 0.8250 0.6244 0.7292 0.7279 musical_instruments 0.8559 0.9238 0.9228 0.7388 0.8440 0.8425 0.8319 0.9122 0.9113 0.6253 0.7741 0.7705 0.7803 0.8540 0.8510 grocery 0.7911 0.8793 0.8788 0.7931 0.8689 0.8686 0.7876 0.8797 0.8791 0.7644 0.8491 0.8488 0.7062 0.8125 0.8117 camera_&_photo 0.8153 0.8274 0.8245 0.8441 0.8552 0.8552 0.7957 0.8175 0.8173 0.8391 0.8484 0.8477 0.5547 0.6509 0.6503 doctor 0.8092 0.8345 0.8337 0.7556 0.7911 0.7912 0.7560 0.7917 0.7917 0.7295 0.7705 0.7702 0.5823 0.6831 0.6825 tv 0.7437 0.7689 0.7673 0.6719 0.6969 0.6946 0.6517 0.6986 0.6977 0.6512 0.6869 0.6851 0.6154 0.6817 0.6795 dvd 0.7229 0.7001 0.6971 0.7168 0.7309 0.7299 0.7103 0.7103 0.7082 0.7008 0.7035 0.6998 0.5469 0.6539 0.6538 camp 0.7967 0.7948 0.7930 0.7953 0.7851 0.7820 0.8053 0.7920 0.7922 0.7844 0.7451 0.7405 0.5841 0.6926 0.6925 drug 0.7135 0.7357 0.7335 0.6530 0.6852 0.6844 0.6947 0.7099 0.7086 0.6416 0.6793 0.6792 0.5725 0.6018 0.5996 magazines 0.8612 0.8635 0.8631 0.8999 0.9055 0.9051 0.8543 0.8631 0.8636 0.8901 0.8924 0.8921 0.5493 0.6439 0.6430 video 0.7760 0.7416 0.7419 0.7658 0.7611 0.7600 0.7550 0.7274 0.7253 0.7294 0.6874 0.6845 0.5244 0.6300 0.6297 health&personal_care 0.7875 0.7918 0.7908 0.7862 0.7917 0.7914 0.7715 0.7787 0.7789 0.7773 0.7787 0.7779 0.5754 0.6566 0.6561 camera 0.7481 0.7774 0.7673 0.7433 0.7594 0.7561 0.7407 0.7601 0.7524 0.7258 0.7666 0.7608 0.5877 0.6528 0.6459 music 0.6959 0.7237 0.7227 0.6613 0.6964 0.6889 0.6587 0.6809 0.6730 0.6547 0.6993 0.6960 0.5643 0.6487 0.6444 apparel 0.7676 0.8026 0.8003 0.7591 0.7966 0.7948 0.7721 0.7917 0.7904 0.7379 0.7838 0.7820 0.5814 0.6711 0.6699 radio 0.6872 0.7349 0.7336 0.6335 0.7100 0.7098 0.5987 0.6856 0.6863 0.6115 0.7023 0.7017 0.5585 0.6330 0.6306 outdoor_living 0.7914 0.8771 0.8767 0.8081 0.8827 0.8823 0.7917 0.8763 0.8760 0.7934 0.8706 0.8702 0.6773 0.7896 0.7889 software 0.8019 0.8107 0.8107 0.8190 0.8411 0.8407 0.7912 0.8053 0.8053 0.8067 0.8336 0.8333 0.5555 0.6602 0.6599 books 0.7641 0.7579 0.7572 0.7499 0.7630 0.7626 0.7416 0.7422 0.7418 0.7294 0.7612 0.7601 0.5636 0.6563 0.6539 gourmet_food 0.8473 0.9130 0.9127 0.8066 0.8771 0.8763 0.8434 0.9116 0.9112 0.7716 0.8527 0.8514 0.7175 0.8238 0.8233 restaurant 0.8912 0.9348 0.9347 0.8873 0.9339 0.9338 0.8881 0.9335 0.9334 0.8711 0.9198 0.9194 0.7518 0.8454 0.8451 Table 4: Accuracies, micro F1 and macro F1 scores for the in-domain experiment.

The best performing feature-set for the in-domain experiment with the Naive Bayes classifier is the 1 gram feature-set. The reported accuracy for this feature-set is 0.7901. The combined feature-sets (1 gram + POS-pattern and 1 gram + 2 gram) both have lower accuracy scores in comparison to the single feature-sets (1 gram and 2 gram). The standard deviation over all the different feature-sets is relatively low, the highest standard deviation over the accuracies is reported for the combined 1 gram + POSpattern feature-set with 0.0817. This low standard deviation implies that the performances are within a close range, making the Naive Bayes classifier a stable classifier for this experiment. The reported micro F1 scores range from 0.8195 for the 1 gram feature-set to 0.7862 for the combined 1 gram + 2 41

gram feature-set . The reported macro F1 scores range from 0.8186 for the 1 gram feature-set to 0.7852 for the combined 1 gram + 2 gram feature-et. The differences between the different reported micro and macro F1 scores are very small showing that performances for imbalanced and balanced evaluationsets are very similar. As an alternative classifier the subjectivity lexicon classifier is analyzed. The average accuracy for this classifier is 0.6092 which is considerable lower in comparison to the results reported for the different features-sets used for the Naive Bayes classifier. The average reported micro F1 and macro F1 scores are 0.7065 and 0.7049, which is also lower in comparison to the earlier results. The difference between the micro and macro F1 score is smaller in comparison to the Naive Bayes classifier showing even more stability in case of differences between the evaluation of balanced and imbalanced sets. The standard deviation over the accuracies is 0.0684 which is slightly higher in comparison to the 1 gram Naive Bayes classifier but lower in comparison to the standard deviation reported for the rest of the Naive Bayes feature-sets. This makes the subjectivity lexicon classifier also a stable classifier for this task. 5.2 Cross-domain baseline experiments In this section the results for the naive cross-domain approach are presented. This experiment can be compared to previous work to evaluate performance of the general setup and can also be used to evaluate the performance difference of the Naive Bayes classifier from in-domain to naive crossdomain application for the different datasets. The results are presented in Table 5.

42

Dataset kitchen&housewares toys_&_games sports_&_outdoors electronics beauty office_products automotive laptop jewelry&watches stanford senpol lawyer baby tools&hardware music2 computer&videogames senscale cell phones&service musical_instruments grocery camera_&_photo doctor tv dvd camp drug magazines video health&personal_care camera music apparel radio outdoor_living software books gourmet_food restaurant average standard deviation

1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro 0.6480 0.6747 0.6732 0.6459 0.6703 0.6691 0.6320 0.6473 0.6457 0.6346 0.6415 0.6401 0.6390 0.6817 0.6808 0.6397 0.6713 0.6705 0.6264 0.6574 0.6565 0.6258 0.6357 0.6348 0.6430 0.6638 0.6626 0.6482 0.6680 0.6668 0.6303 0.6392 0.6380 0.6306 0.6333 0.6322 0.6465 0.6579 0.6571 0.6390 0.6543 0.6535 0.6252 0.6217 0.6209 0.6263 0.6214 0.6203 0.6702 0.7378 0.7368 0.6739 0.7465 0.7453 0.6490 0.7076 0.7066 0.6447 0.7071 0.7058 0.6952 0.7900 0.7887 0.6912 0.7857 0.7836 0.6653 0.7592 0.7575 0.6525 0.7456 0.7433 0.6624 0.7510 0.7490 0.6769 0.7666 0.7652 0.6352 0.7196 0.7177 0.6368 0.7243 0.7225 0.6447 0.6302 0.6150 0.6451 0.6449 0.6251 0.6291 0.5965 0.5805 0.6246 0.5960 0.5800 0.7283 0.8092 0.8082 0.7289 0.8106 0.8096 0.7018 0.7813 0.7800 0.6963 0.7782 0.7769 0.5449 0.5792 0.5791 0.5342 0.6160 0.6160 0.5326 0.5520 0.5520 0.5346 0.5770 0.5770 0.5849 0.5465 0.5447 0.5526 0.4491 0.4481 0.5768 0.5093 0.5077 0.5325 0.3725 0.3715 0.6545 0.6854 0.6779 0.6460 0.6671 0.6627 0.6383 0.6469 0.6407 0.6274 0.6345 0.6303 0.6354 0.6702 0.6690 0.6364 0.6695 0.6684 0.6220 0.6471 0.6459 0.6211 0.6324 0.6314 0.6566 0.7609 0.7549 0.7170 0.8117 0.8057 0.6380 0.7437 0.7369 0.6385 0.7427 0.7333 0.6121 0.6558 0.6553 0.5825 0.6144 0.6137 0.5961 0.6257 0.6251 0.5719 0.5719 0.5712 0.6648 0.7352 0.7344 0.6564 0.7223 0.7214 0.6462 0.7100 0.7093 0.6104 0.6578 0.6569 0.5964 0.6065 0.6061 0.5523 0.5052 0.5048 0.5736 0.5631 0.5627 0.5165 0.4218 0.4215 0.6461 0.6862 0.6836 0.6624 0.7160 0.7142 0.6227 0.6525 0.6497 0.6335 0.6740 0.6721 0.7196 0.8106 0.8075 0.7108 0.8084 0.8045 0.6918 0.7829 0.7789 0.6675 0.7672 0.7632 0.6981 0.7786 0.7778 0.6934 0.7778 0.7769 0.6669 0.7432 0.7423 0.6651 0.7457 0.7447 0.6540 0.6810 0.6797 0.6501 0.6690 0.6680 0.6368 0.6501 0.6489 0.6296 0.6290 0.6278 0.6694 0.7048 0.7039 0.6532 0.6889 0.6878 0.6386 0.6531 0.6519 0.6434 0.6667 0.6655 0.6351 0.6790 0.6756 0.6111 0.6615 0.6575 0.6046 0.6363 0.6329 0.6087 0.6415 0.6374 0.6255 0.6562 0.6558 0.5990 0.6132 0.6128 0.6128 0.6310 0.6305 0.5834 0.5683 0.5678 0.6451 0.6898 0.6888 0.6375 0.6868 0.6863 0.6277 0.6610 0.6597 0.6139 0.6447 0.6434 0.5735 0.5656 0.5644 0.5687 0.5837 0.5822 0.5604 0.5368 0.5354 0.5656 0.5595 0.5578 0.6427 0.6825 0.6817 0.6205 0.6451 0.6441 0.6271 0.6543 0.6534 0.6002 0.6009 0.5999 0.6155 0.6471 0.6462 0.5911 0.6059 0.6048 0.6000 0.6158 0.6150 0.5807 0.5647 0.5636 0.6331 0.6548 0.6531 0.6421 0.6669 0.6658 0.6197 0.6281 0.6265 0.6296 0.6385 0.6372 0.6516 0.6977 0.6923 0.6539 0.6877 0.6821 0.6399 0.6727 0.6674 0.6408 0.6546 0.6470 0.6018 0.6543 0.6512 0.6152 0.6576 0.6545 0.5879 0.6250 0.6215 0.5972 0.6109 0.6074 0.6473 0.6813 0.6792 0.6514 0.6821 0.6798 0.6299 0.6512 0.6490 0.6386 0.6562 0.6540 0.5978 0.6369 0.6345 0.5868 0.6302 0.6283 0.5743 0.5926 0.5904 0.5814 0.6049 0.6029 0.6923 0.7715 0.7706 0.6838 0.7659 0.7651 0.6617 0.7393 0.7384 0.6487 0.7261 0.7253 0.6680 0.6902 0.6898 0.6577 0.6813 0.6808 0.6521 0.6614 0.6609 0.6426 0.6477 0.6471 0.6068 0.6451 0.6428 0.5909 0.5933 0.5910 0.5990 0.6213 0.6193 0.5713 0.5426 0.5400 0.6987 0.7906 0.7898 0.6973 0.7918 0.7910 0.6603 0.7513 0.7505 0.6517 0.7476 0.7465 0.7624 0.8416 0.8412 0.7146 0.8028 0.8025 0.7058 0.7915 0.7910 0.6598 0.7517 0.7513 0.6477 0.6916 0.6895 0.6410 0.6813 0.6792 0.6273 0.6600 0.6578 0.6178 0.6404 0.6382 0.0431 0.0683 0.0686 0.0481 0.0819 0.0819 0.0367 0.0683 0.0685 0.0393 0.0862 0.0860 Table 5: results of the naive cross-domain experiments.

The best performing feature-set in terms of accuracy for the naive cross-domain approach is the 1 gram feature-set with an accuracy of 0.6477. The reported accuracy for the 2 gram feature-set is slightly lower with 0.6410. The combined feature-sets perform lower with accuracies of 0.6273 and 0.6178. The standard deviation over the accuracies is slightly lower in comparison to the in-domain experiment with a highest reported standard deviation of 0.0481 for the 2 gram feature-set. The highest reported micro F1 and macro F1 scores are also observed for the 1 gram feature-set with considerable lower scores for the combined feature-sets as well. The differences between the reported micro and macro F1 scores for the different feature-sets are a little higher in comparison to in-domain performance. As 43

discussed for the in-domain experiment the performance for the subjectivity lexicon classifier is the same in a cross-domain setting. Comparing the subjectivity lexicon classifier to the naive cross-domain approach the performance of the subjectivity lexicon classifier is still lower. In comparison to the in-domain performance all feature-sets decrease in performance. The 1 gram feature-set decreases the most in performance with a difference of 0.142. All feature-sets decrease in performance with decreases over 0.128. For the micro F1 and macro F1 scores the same decrease in performance is observed, all scores decrease from in-domain to cross-domain with decreases bigger than 0.128. 5.3.1 Adapted Naive Bayes classifier using FCE The first experiment performing domain-adaptation between domains uses the adapted Naive Bayes classifier using FCE: ANB (FCE). As explained in the previous chapter FCE is used to select generalizable features. The results for the experiments conducted in regard to this research are presented in this section, starting with Table 6.

44

Dataset

1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro kitchen&housewares 0.4475 0.2836 0.2816 0.6581 0.6958 0.6939 0.5049 0.4783 0.4771 0.5523 0.3059 0.3038 toys_&_games 0.5207 0.4771 0.4754 0.6486 0.6795 0.6777 0.4964 0.5769 0.5763 0.5845 0.4733 0.4724 sports_&_outdoors 0.4684 0.3490 0.3464 0.6566 0.6947 0.6921 0.5144 0.6196 0.6188 0.5997 0.4707 0.4695 electronics 0.4473 0.2809 0.2786 0.6491 0.6930 0.6911 0.5065 0.5461 0.5451 0.5777 0.3496 0.3467 beauty 0.5028 0.4424 0.4397 0.6184 0.6659 0.6633 0.6210 0.7317 0.7310 0.5913 0.6279 0.6265 office_products 0.4634 0.3381 0.3352 0.5708 0.6720 0.6698 0.7480 0.8302 0.8291 0.3365 0.3753 0.3721 automotive 0.4523 0.2771 0.2751 0.5736 0.6255 0.6232 0.7028 0.7982 0.7972 0.3722 0.3791 0.3776 laptop 0.5537 0.6348 0.6329 0.5673 0.6338 0.6322 0.5054 0.6251 0.6125 0.4332 0.5354 0.5108 jewelry&watches 0.4757 0.3428 0.3398 0.6058 0.6713 0.6689 0.6964 0.7978 0.7968 0.5205 0.5802 0.5781 stanford 0.5232 0.4928 0.4908 0.6754 0.7455 0.7438 0.5241 0.6471 0.6471 0.5513 0.6102 0.6102 senpol 0.5668 0.6582 0.6562 0.6263 0.6860 0.6835 0.6074 0.6232 0.6221 0.5271 0.6540 0.6527 lawyer 0.4831 0.3925 0.3898 0.4757 0.2988 0.2970 0.5079 0.3281 0.3259 0.4600 0.2454 0.2418 baby 0.4919 0.4103 0.4086 0.6806 0.7390 0.7370 0.5441 0.6499 0.6490 0.5429 0.3906 0.3883 tools&hardware 0.5773 0.6991 0.6977 0.5819 0.7065 0.7050 0.8236 0.8923 0.8895 0.8536 0.9069 0.9050 music2 0.5794 0.7039 0.7021 0.6292 0.6663 0.6634 0.5148 0.5992 0.5988 0.5388 0.6800 0.6796 computer&videogames 0.5521 0.4926 0.4901 0.5969 0.6372 0.6350 0.5955 0.6820 0.6811 0.6511 0.7167 0.7155 senscale 0.5744 0.6685 0.6668 0.6095 0.6400 0.6376 0.6516 0.7467 0.7464 0.6229 0.7448 0.7445 cell phones&service 0.5230 0.5179 0.5163 0.6178 0.6740 0.6716 0.5921 0.7043 0.7018 0.5866 0.6471 0.6447 musical_instruments 0.4917 0.3637 0.3610 0.5748 0.6818 0.6799 0.7804 0.8613 0.8599 0.3420 0.3922 0.3865 grocery 0.4987 0.4557 0.4532 0.6143 0.6747 0.6724 0.6593 0.7655 0.7647 0.5577 0.6221 0.6214 camera_&_photo 0.5439 0.6055 0.6033 0.6707 0.7006 0.6986 0.5222 0.6427 0.6419 0.5809 0.6139 0.6134 doctor 0.5107 0.4505 0.4486 0.5726 0.5787 0.5768 0.5281 0.5651 0.5640 0.5793 0.4513 0.4497 tv 0.4606 0.2782 0.2756 0.4764 0.3025 0.2997 0.5063 0.3958 0.3895 0.4140 0.3072 0.3048 dvd 0.5901 0.7027 0.7009 0.6399 0.6707 0.6686 0.5003 0.6331 0.6327 0.5474 0.6809 0.6806 camp 0.5798 0.7074 0.7056 0.5779 0.6551 0.6522 0.4848 0.5980 0.5972 0.4821 0.6163 0.6156 drug 0.4237 0.1757 0.1741 0.5360 0.4858 0.4836 0.5025 0.3941 0.3927 0.5107 0.1915 0.1920 magazines 0.4989 0.4405 0.4379 0.6496 0.6801 0.6779 0.5271 0.6310 0.6303 0.6204 0.5226 0.5224 video 0.5725 0.6453 0.6431 0.6188 0.6398 0.6378 0.5260 0.5951 0.5944 0.5888 0.6771 0.6767 health&personal_care 0.5225 0.5446 0.5427 0.6638 0.7029 0.7007 0.5097 0.6310 0.6297 0.5748 0.5477 0.5466 camera 0.5097 0.4849 0.4831 0.6012 0.5773 0.5753 0.5104 0.6387 0.6351 0.5116 0.3546 0.3469 music 0.5106 0.4881 0.4854 0.5696 0.5855 0.5838 0.5128 0.5657 0.5633 0.5994 0.4965 0.4935 apparel 0.4633 0.3039 0.3017 0.6286 0.6388 0.6373 0.5152 0.6209 0.6191 0.5457 0.3231 0.3212 radio 0.4561 0.3070 0.3049 0.5145 0.4339 0.4319 0.5328 0.5070 0.5060 0.5166 0.3228 0.3217 outdoor_living 0.4572 0.3203 0.3175 0.6068 0.6599 0.6571 0.6596 0.7586 0.7580 0.4579 0.4598 0.4586 software 0.4442 0.2196 0.2167 0.6405 0.6929 0.6911 0.5173 0.5325 0.5320 0.5363 0.2508 0.2499 books 0.5529 0.6263 0.6245 0.6397 0.6851 0.6833 0.5007 0.6422 0.6403 0.6768 0.6931 0.6914 gourmet_food 0.4760 0.3329 0.3309 0.5907 0.6476 0.6452 0.6856 0.7858 0.7852 0.4497 0.5131 0.5115 restaurant 0.4606 0.2752 0.2726 0.6112 0.6793 0.6773 0.6792 0.7871 0.7866 0.4494 0.5338 0.5328 0.5060 0.4524 0.4502 0.6063 0.6368 0.6346 0.5741 0.6428 0.6413 0.5380 0.5069 0.5047 average standard deviation 0.0471 0.1550 0.1552 0.0494 0.1000 0.1000 0.0912 0.1274 0.1278 0.0954 0.1639 0.1644 Table 6: results for the adapted Naive Bayes classifier using FCE to select generalizable features.

From the results presented in Table 6 it is noticeable that the performance for all the different featuresets are quite low. Best performing feature-set is the 2 gram feature set with an accuracy of 0.6063. The best performance in terms of micro and macro F1 score is achieved by the combined 1 gram + POSpattern feature-set. The 1 gram feature-set has the lowest accuracy of 0.5060. In comparison to the naive cross-domain approach the ANB using FCE performs lower. The difference in terms of accuracy is the biggest for the 1 gram feature-set with a decrease in accuracy of 0.1417. The smallest difference 45

is achieved by the 2 gram feature-set, where the accuracy decreases with 0.0347. The decrease in performance in terms of micro and macro F1 scores is even higher for the 1 gram feature-set with decreases both over 0.23. The smallest difference is observed for the 1 gram + POS-pattern feature-set with a decrease just over 0.01. The standard deviation over the accuracies for the 1 gram and 2 gram feature-sets are just over 0.04 which is quite low showing stable performance over the different datasets used. The standard deviation over the accuracies of both the combined feature-sets is a little higher with scores over 0.09 indicating that performances differ more from dataset to dataset. The differences between the different reported micro and macro F1 scores are smaller compared to the naive crossdomain classifier. The smaller difference shows that performances for imbalanced and balanced evaluation-sets is very similar to using FCE. Looking at the difference between the datasets, for the 1 gram features it seems that mainly the unbalanced datasets (tools & hardware, automotive, office products) show performances under-average. This indicates that the priors might cause problems while performing domain-adaptations with this feature-set. For the remaining feature-sets this seems to be less of a problem. For the 2 gram feature-set the smaller data-sets seem to be performing with on average less performance (office products, lawyer, tools & hardware), following the trend which can be observed from the previous experiments. For the combined feature-sets it is noticeable that the tools & hardware dataset performs the highest. This is probably caused by the specific characteristics of the dataset as for the rest of the datasets no real trend in problems with size or balance can be detected. Looking at the differences in results from naive cross-domain approach and the ANB using FCE on a dataset level it is noticeable that the imbalanced datasets (restaurant, gourmet food, outdoor living) seem to decrease most in performance in comparison to balanced datasets. In contrast, for the 2 gram feature-set the bigger datasets seem to benefit from the ANB approach using FCE with increased results (stanford, senpol, senscale). Even for the on average worst-performing combined 1 gram + POS-pattern feature-set, compared to naive cross-domain some individual datasets actually increase in performance using FCE. Some of the bigger datasets (senpol, senscale) and a number of imbalance datasets (tools & hardware, automotive, office products) actually perform better using FCE in comparison to the naive cross-domain approach. For the combined 1 gram + 2 gram features only the stanford dataset shows an increase in performance compared to the naive cross-domain approach. 5.3.2 Adapted Naive Bayes classifier using subjectivity lexicon The first suggested improvement for the adapted Naive Bayes classifier is the replacement of the by FCE selected generalizable features with a subjectivity lexicon: ANB (SL). In this section the results for this suggested improvement are discussed, the results are presented in Table 7.

46

Dataset

1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro kitchen&housewares 0.4518 0.3335 0.3306 0.6830 0.7185 0.7170 0.4885 0.5159 0.5151 0.5771 0.3821 0.3801 toys_&_games 0.5120 0.4857 0.4832 0.6677 0.6976 0.6961 0.4797 0.5624 0.5619 0.5914 0.4991 0.4982 sports_&_outdoors 0.4711 0.3943 0.3918 0.6793 0.7158 0.7136 0.4716 0.4884 0.4872 0.6153 0.5346 0.5332 electronics 0.4557 0.3157 0.3129 0.6817 0.7230 0.7212 0.4928 0.5678 0.5670 0.5855 0.3908 0.3879 beauty 0.5020 0.4998 0.4976 0.6224 0.6659 0.6632 0.4506 0.4654 0.4644 0.6319 0.6950 0.6940 office_products 0.4391 0.2221 0.2197 0.5224 0.5493 0.5464 0.4757 0.5482 0.5458 0.2251 0.2182 0.2159 automotive 0.4550 0.2995 0.2963 0.5530 0.5588 0.5563 0.4968 0.5723 0.5704 0.3464 0.3560 0.3544 laptop 0.5398 0.6110 0.6091 0.5448 0.5828 0.5810 0.4549 0.5478 0.5313 0.3337 0.4123 0.3906 jewelry&watches 0.4961 0.4014 0.3990 0.6080 0.6643 0.6619 0.4894 0.5617 0.5603 0.5330 0.6015 0.5995 stanford 0.5602 0.5573 0.5555 0.6766 0.7464 0.7449 0.5197 0.6570 0.6570 0.5591 0.6264 0.6264 senpol 0.5576 0.6576 0.6555 0.6359 0.6938 0.6913 0.6065 0.6133 0.6122 0.5267 0.6632 0.6619 lawyer 0.4914 0.4919 0.4893 0.4712 0.2771 0.2753 0.4850 0.3302 0.3270 0.3791 0.2049 0.2014 baby 0.4901 0.4503 0.4482 0.6959 0.7515 0.7498 0.5063 0.5037 0.5025 0.5808 0.5087 0.5072 tools&hardware 0.4569 0.2750 0.2731 0.5494 0.6446 0.6430 0.5472 0.6618 0.6531 0.2015 0.2474 0.2378 music2 0.5523 0.6505 0.6480 0.6419 0.6782 0.6755 0.4845 0.4430 0.4426 0.5776 0.6820 0.6815 computer&videogames 0.5097 0.4678 0.4653 0.6015 0.6368 0.6343 0.3815 0.2489 0.2480 0.6760 0.7503 0.7492 senscale 0.5558 0.6560 0.6539 0.6133 0.6420 0.6395 0.5949 0.6175 0.6171 0.6256 0.7538 0.7536 cell phones&service 0.5339 0.5163 0.5141 0.6198 0.6723 0.6696 0.4930 0.5430 0.5405 0.5992 0.6705 0.6681 musical_instruments 0.4552 0.2242 0.2222 0.4969 0.4824 0.4799 0.5245 0.6114 0.6082 0.2093 0.2037 0.1990 grocery 0.4998 0.4748 0.4724 0.6153 0.6704 0.6680 0.4412 0.4923 0.4914 0.5648 0.6338 0.6331 camera_&_photo 0.5308 0.5575 0.5556 0.6896 0.7180 0.7163 0.4969 0.5283 0.5274 0.6209 0.6229 0.6223 doctor 0.5371 0.4951 0.4934 0.6016 0.6123 0.6108 0.4867 0.5022 0.5010 0.6032 0.4914 0.4902 tv 0.4722 0.3532 0.3505 0.4769 0.2992 0.2970 0.4749 0.3657 0.3598 0.3296 0.2586 0.2555 dvd 0.5595 0.6447 0.6426 0.6536 0.6834 0.6816 0.5046 0.6497 0.6493 0.5622 0.6669 0.6666 camp 0.5635 0.6707 0.6681 0.5743 0.6536 0.6502 0.4399 0.5464 0.5452 0.4540 0.5726 0.5714 drug 0.4500 0.2828 0.2809 0.5677 0.5195 0.5181 0.4872 0.3658 0.3643 0.5190 0.2280 0.2279 magazines 0.5066 0.4898 0.4869 0.6611 0.6900 0.6882 0.4816 0.4706 0.4700 0.6461 0.5775 0.5772 video 0.5506 0.6009 0.5986 0.6237 0.6441 0.6421 0.4919 0.4298 0.4291 0.5940 0.6663 0.6658 health&personal_care 0.5214 0.5250 0.5216 0.6802 0.7185 0.7165 0.4698 0.5344 0.5330 0.5962 0.5639 0.5628 camera 0.5235 0.5596 0.5581 0.6063 0.5772 0.5754 0.4911 0.5985 0.5948 0.4994 0.3852 0.3785 music 0.5148 0.5057 0.5031 0.5863 0.6032 0.6017 0.4901 0.5562 0.5536 0.6175 0.5263 0.5239 apparel 0.4613 0.3319 0.3296 0.6509 0.6601 0.6589 0.4808 0.5417 0.5398 0.5620 0.3642 0.3620 radio 0.4760 0.4209 0.4190 0.5528 0.4808 0.4792 0.4992 0.4480 0.4469 0.5266 0.3640 0.3628 outdoor_living 0.4669 0.3689 0.3660 0.6083 0.6496 0.6469 0.4161 0.4330 0.4319 0.4865 0.5080 0.5066 software 0.4531 0.2647 0.2615 0.6760 0.7252 0.7240 0.5135 0.5938 0.5934 0.5581 0.3188 0.3177 books 0.5430 0.5815 0.5795 0.6537 0.6958 0.6942 0.5028 0.6409 0.6389 0.6711 0.6696 0.6679 gourmet_food 0.4673 0.3359 0.3339 0.5718 0.6096 0.6072 0.3394 0.3772 0.3762 0.4366 0.5002 0.4986 restaurant 0.4723 0.3342 0.3312 0.6084 0.6732 0.6711 0.3533 0.4139 0.4131 0.5043 0.5961 0.5952 0.5015 0.4555 0.4531 0.6111 0.6312 0.6291 0.4817 0.5144 0.5124 0.5191 0.4978 0.4954 average standard deviation 0.0391 0.1324 0.1325 0.0600 0.1062 0.1062 0.0511 0.0964 0.0962 0.1260 0.1653 0.1666 Table 7: results of the adapted Naive Bayes classifier using a subjectivity lexicon.

Apart from the 2 gram feature-set where a very small increase in accuracy can be observed compared to using FCE the first suggested alternative for selecting generalizable features doesn't seem to aid the classification performances. The 1 gram feature-set performs very alike FCE with an decrease in accuracy of 0.0045 and micro F1 and macro F1 differences which are even smaller. The 2 gram featureset shows a small increase in performance with 0.0048 but both F1 scores are a little over 0.005 lower compared to FCE. The combined 1 gram + POS-pattern feature-set performs significantly worst in 47

comparison to FCE with a decrease in accuracy of 0.0925. The decrease in micro and macro F1 score is even bigger with decreases over 0.13. For the combined 1 gram + 2 gram feature-set a slight decrease in accuracy is observed with 0.0189 and even smaller decreases in terms of micro and macro F1 score. Although the accuracy for the 2 gram feature-set slightly increases in comparison to using FCE the standard deviation over the accuracies is a little higher compared to FCE: 0.06. The higher standard deviation shows that performance is less stable over the different data-sets used. The standard deviation over the accuracies of both the combined feature-sets is a little higher as well especially for the combined 1 gram + 2 gram feature-set which has a standard deviation over the accuracies of 0.1260. The differences between the different reported micro and macro F1 scores are comparable to the differences reported by FCE. For all feature-sets it seems that the biggest datasets (stanford, senpol, senscale) show performances above-average, indicating that more data aids domain-adaptations. As for FCE, for the 2 gram featureset the smaller data-sets seem to be performing with on average less performance (office products, lawyer, tools & hardware). When comparing results on a dataset level to ANB using FCE the stanford dataset, considering the 1 gram feature-set shows the most significant increase in performance with an increase in accuracy of 0.0370. For the 2 gram feature-set most of the medium sized datasets seem to benefit from using ANB with a subjectivity lexicon in comparison to ANB using FCE. The datasets performing lower in comparison are mostly small sized datasets for both the combined feature-sets. For the combined 1 gram + POS-pattern feature-set almost all datasets show a decrease in performance compared to ANB using FCE, with a few insignificant exceptional increases in performance for the books and dvd dataset. 5.3.3 Adapted Naive Bayes classifier using improved generalizable features The final experiment conducted for this research is the utilization of the adapted Naive Bayes classifier using improved generalizable features: ANB (IMP). The results for the different conducted experiments are presented in this section, starting with the results presented in Table 8

48

Dataset

1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro kitchen&housewares 0.4469 0.2912 0.2890 0.6571 0.6965 0.6947 0.4978 0.4946 0.4939 0.5565 0.3259 0.3241 toys_&_games 0.5191 0.4727 0.4710 0.6475 0.6787 0.6771 0.4949 0.5464 0.5459 0.5822 0.4708 0.4699 sports_&_outdoors 0.4816 0.3869 0.3841 0.6551 0.6936 0.6910 0.4812 0.5006 0.4994 0.6071 0.5041 0.5029 electronics 0.4451 0.2713 0.2688 0.6530 0.6981 0.6962 0.4975 0.5365 0.5355 0.5704 0.3370 0.3340 beauty 0.5102 0.4770 0.4742 0.6203 0.6700 0.6671 0.4680 0.4967 0.4953 0.6102 0.6586 0.6574 office_products 0.4599 0.3278 0.3249 0.5786 0.6778 0.6752 0.5203 0.5990 0.5967 0.3468 0.3881 0.3852 automotive 0.4498 0.2809 0.2787 0.5805 0.6426 0.6403 0.5178 0.5974 0.5957 0.3858 0.3988 0.3970 laptop 0.5527 0.6445 0.6425 0.5728 0.6393 0.6376 0.4635 0.5585 0.5419 0.4446 0.5462 0.5226 jewelry&watches 0.4917 0.3943 0.3921 0.6122 0.6824 0.6799 0.5266 0.6135 0.6120 0.5595 0.6314 0.6295 stanford 0.5231 0.4924 0.4903 0.6754 0.7454 0.7437 0.5242 0.6469 0.6469 0.5514 0.6097 0.6096 senpol 0.5624 0.6552 0.6533 0.6223 0.6826 0.6799 0.6272 0.5907 0.5896 0.5294 0.6550 0.6537 lawyer 0.4852 0.3926 0.3899 0.4793 0.2997 0.2980 0.4853 0.3303 0.3271 0.4713 0.2597 0.2555 baby 0.4828 0.4098 0.4081 0.6815 0.7400 0.7383 0.5069 0.5013 0.5003 0.5479 0.4121 0.4100 tools&hardware 0.5584 0.6303 0.6286 0.5781 0.6919 0.6902 0.5914 0.7015 0.6929 0.8120 0.8795 0.8762 music2 0.5763 0.6985 0.6966 0.6325 0.6705 0.6674 0.4916 0.4916 0.4913 0.5420 0.6809 0.6804 computer&videogames 0.5650 0.5257 0.5235 0.6025 0.6441 0.6418 0.4110 0.3453 0.3441 0.6565 0.7314 0.7302 senscale 0.5730 0.6706 0.6687 0.6062 0.6380 0.6355 0.5953 0.6117 0.6110 0.6225 0.7458 0.7455 cell phones&service 0.5319 0.5593 0.5569 0.6237 0.6820 0.6795 0.5150 0.5795 0.5771 0.6070 0.6876 0.6852 musical_instruments 0.4881 0.3476 0.3455 0.5806 0.6810 0.6792 0.5681 0.6594 0.6563 0.3435 0.3973 0.3919 grocery 0.5118 0.4953 0.4930 0.6147 0.6800 0.6774 0.4657 0.5307 0.5295 0.5846 0.6544 0.6537 camera_&_photo 0.5451 0.6065 0.6044 0.6758 0.7058 0.7040 0.5074 0.5687 0.5678 0.5768 0.6180 0.6175 doctor 0.4969 0.4161 0.4141 0.5855 0.5901 0.5886 0.4986 0.4984 0.4971 0.5875 0.4515 0.4497 tv 0.4581 0.2764 0.2738 0.4809 0.3126 0.3106 0.4760 0.3654 0.3596 0.4262 0.3181 0.3158 dvd 0.5928 0.7039 0.7023 0.6393 0.6714 0.6694 0.4998 0.6488 0.6484 0.5425 0.6783 0.6779 camp 0.5808 0.7079 0.7062 0.5782 0.6557 0.6524 0.4435 0.5520 0.5508 0.4899 0.6220 0.6213 drug 0.4255 0.1908 0.1892 0.5514 0.5067 0.5049 0.4878 0.3360 0.3349 0.5110 0.1987 0.1992 magazines 0.5002 0.4513 0.4490 0.6503 0.6817 0.6796 0.4860 0.4787 0.4780 0.6145 0.5267 0.5265 video 0.5722 0.6401 0.6384 0.6180 0.6397 0.6374 0.4941 0.4607 0.4599 0.5886 0.6757 0.6753 health&personal_care 0.5222 0.5477 0.5461 0.6624 0.7016 0.6990 0.4803 0.5519 0.5505 0.5755 0.5575 0.5563 camera 0.5119 0.4744 0.4723 0.6038 0.5818 0.5799 0.4922 0.5938 0.5897 0.5242 0.3698 0.3620 music 0.5039 0.4660 0.4632 0.5749 0.5894 0.5878 0.4982 0.5493 0.5466 0.6168 0.5090 0.5062 apparel 0.4580 0.2938 0.2917 0.6300 0.6404 0.6389 0.4957 0.5511 0.5490 0.5481 0.3239 0.3217 radio 0.4582 0.3085 0.3065 0.5382 0.4599 0.4583 0.4996 0.4450 0.4439 0.5226 0.3351 0.3341 outdoor_living 0.4657 0.3449 0.3421 0.6168 0.6751 0.6724 0.4510 0.4871 0.4862 0.4824 0.4971 0.4957 software 0.4442 0.2325 0.2302 0.6464 0.6995 0.6979 0.5132 0.5590 0.5584 0.5415 0.2723 0.2713 books 0.5374 0.6068 0.6049 0.6394 0.6849 0.6828 0.5080 0.6399 0.6379 0.6597 0.6688 0.6671 gourmet_food 0.4820 0.3580 0.3562 0.5990 0.6618 0.6597 0.3809 0.4390 0.4378 0.4786 0.5482 0.5466 restaurant 0.4673 0.3002 0.2977 0.6130 0.6836 0.6817 0.3837 0.4577 0.4568 0.4684 0.5551 0.5540 0.5062 0.4566 0.4544 0.6099 0.6415 0.6393 0.4959 0.5293 0.5273 0.5444 0.5184 0.5161 average standard deviation 0.0459 0.1624 0.1624 0.1066 0.1386 0.1383 0.0915 0.1212 0.1208 0.1217 0.1768 0.1769 Table 8: results of the adapted Naive Bayes classifier using improved generalizable features.

As observed for ANB using FCE the method using improved generalizable features shows the best performance in terms of accuracies for the 2 gram feature-set with a performance of 0.6099 in terms of accuracy. The micro F1 and macro F1 scores are 0.6415 and 0.6393. For the improved generalizable features the combined 1 gram + POS-pattern feature-set is the worst performing feature-set with an average accuracy of 0.4959, the micro and macro F1 scores for the 1 gram features are the lowest with 0.4566 and 0.4544. The differences between micro F1 and macro F1 scores are very small showing 49

consistent results between balanced and imbalance evaluation-sets. Compared to the ANB classifier using FCE the performance is very similar. The 1 gram and 2 gram feature-sets show minimal improvements in terms of accuracy of 0.0003. The increase in performance for the combined 1 gram + 2 gram feature-set is a little more significant with 0.0064. The performance for the combined 1 gram + POS-pattern feature-set is a little lower in comparison to ANB using FCE. The difference in accuracy is 0.0782, the differences in micro and macro F1 are 0.11. In comparison to ANB using a subjectivity lexicon the performance for ANB using improved generalizable features increases a little for all the feature-sets except the 2 gram feature-set. The difference in accuracy for the 1 gram feature-set is very small with 0.0048, the differences for the combined feature-sets is a little more significant with 0.00142 and 0.0252. For the micro and macro F1 scores all feature-sets show slight improvement over FCE using a subjectivity lexicon. Comparing the results to the naive cross-domain approach the ANB the same observations as for using FCE can be made although the improved method performs very slightly better. The performance is lower in comparison to the naive cross-domain baseline. The difference in terms of accuracy is the biggest for the 1 gram feature-set with 0.1415. The smallest difference is achieved by the 2 gram feature-set where the difference is 0.0311. The decrease in performance in terms of micro and macro F1 scores is the highest for the combined 1 gram + POS-pattern feature-set with differences over 0.23. The smallest difference is observed for the combined 1 gram + 2 gram feature-set with a difference just over 0.03. The standard deviation over the accuracies for the 1 gram feature-set is quite low with 0.0459, for the rest of the feature-sets the standard deviation is very high with values all over 0.10. In comparison to the ANB using FCE this is not surprising as the standard deviations there were quite high as well, except for the 2 gram feature-set which was low for FCE but high for the improved method. The high standard deviation learn that although the performance has increased a little over FCE the stability has changed and the quality of the domain-adaptation is more dependent on the supplied data. As observed for FCE the smaller and unbalance datasets show below average performance for the domain-adaptations using 1 gram and 2 gram feature-sets. For the combined 1 gram + POS-pattern feature-set the smaller sized datasets (drug, camp, musical instruments) seem to perform below average. The balance of the training-data is of less importance for the combined feature-sets as the best performing data-set is again the tools & hardware dataset. As observed for ANB using a subjectivity lexicon, the bigger datasets (stanford, senscale, senpol) seem to have above-average performance indicating that the availability of data seems to aid classification accuracies. When comparing ANB using improved generalizable features to ANB using FCE on a dataset level the results for the 1 gram feature-set are very similar. The main differences lies in a slight increase in performance mainly for the imbalanced datasets (beauty, jewelery & watches, computer & video games, grocery). Most of the datasets show increases in performance using ANB using improved generalizable features compared to ANB using FCE for the 2 gram feature-set, the few datasets showing decreases in performance are mainly medium sized datasets. For the combined feature-sets most datasets increase in performance compared to ANB using FCE, the increases are all caused by performance increases of medium sized datasets. Compared to naive cross-domain classification the performances for the 1 gram feature-set decreases considerably. While on average the performance of all the feature-sets decreases, for the 2 gram feature-set, combined 1 gram + POS-pattern feature-set and combined 1 gram + 2 gram feature-set the performance for the bigger datasets (stanford, senscale, senpol) actually shows increases.

50

Chapter 6 Results analysis
In this chapter the results for the different conducted experiments in this thesis will be analyzed. First the in-domain experiments will be discussed. Next the naive cross-domain baseline method will be considered. In the third section of this chapter the cross-domain adaptation using generalizable features will be taken into account together with the different suggested improvements for the selection of generalizable features. 6.1 In-domain baseline experiments As a baseline for in-domain classification two different classifiers were used: the multinomial Naive Bayes classifier and the subjectivity lexicon classifier. The baseline in-domain experiment was conducted in order to evaluate whether the general setup results in performance comparable to previous research. The multinomial Naive Bayes classifiers was trained based on different feature-sets (1 gram features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features). In the following sections the results for the different classifiers will be discussed. 6.1.2 Naive Bayes Classifier The first experiment discussed in this chapter considers the in-domain performance of the Naive Bayes classifier. The performance of four different feature-sets were considered over the 38 different datasets used in this research. After the discussion about the feature-sets confusion matrices are presented for a couple of representative domains. Next, to gain insight in the behavior of the Naive Bayes classifier and the features used for the different in-domain classifications detail is provided of different examples of features and their probabilities. The best performing features for the in-domain experiment were the 1 gram features. Looking at the smaller datasets the accuracies seem to be above average compared to the rest of the datasets: the best performing dataset is even the smallest one: the tools & hardware dataset, both in terms of accuracy as well as reported micro F1 and macro F1 scores. Looking at the average sized datasets (like kitchen & housewares, sports & outdoor, music2 and video) the performance is around the reported average with accuracies of 0.7730, 0.7873, 0.7599 and 0.7760. The small differences are probably caused by small differences in the datasets and the examples used. The small differences can be caused by differences in annotations for different examples within the dataset or domain inconsistencies in terminology used by the different reviewers of the products. Looking at the biggest (stanford) dataset it is noticeable that the performance is quite lower compared to the results reported by Go et al. (2009). In the paper by Go et al. (2009) accuracies of 0.80 and more are reported where performance in the experiments here is 0.6774 for the stanford dataset. The stanford dataset is even the worst performing dataset in this experiment in terms of accuracy, although the micro F1 and macro F1 scores for the dvd dataset are lower. The under-performance of the stanford dataset is probably caused by the slightly adapted form of the Naive Bayes classifier used by the authors to handle recurring terms in tweets (in tweets people often use constructs like HAPPY HAPPY HAPPY). In order to be able to consistently use the same 51

approach between domains the decisions was made to not use this adapted form which, as experimentation showed to make only a positive difference for the stanford dataset. The low standard deviation shows that the multinomial Naive Bayes classifier is a stable classifier in the context of in-domain classifiers. The classifier has shown similar accuracy results in different domains following the same general setup. These are good properties in terms of training a new sentiment classification system for a domain where training-data is available or can be easily (automatically) gathered. Following the methodology described in this thesis accurate sentiment classifiers can easily be setup when annotated examples are available. As described in previous work (Pang and Lee, 2002) the Naive Bayes classifier shows to be a stable classifier. To compare results to previous research is somewhat limited as no other research before used a collection of the different dataset collected for this thesis. However, looking at the results reported by Pang and Lee (2002) for the senpol dataset the results for the 1 gram experiments are similar: Pang and Lee (2002) report an accuracy of 0.787. For this research the reported accuracy for the senpol dataset is 0.7880. In research by Turney (2002) accuracies for in-domain classification are reported in the range of 0.66 to 0.84, although Turney uses a SVM classifier instead of the in this research used multinomial Naive Bayes classifier the accuracies correlate to the performances reported in Turney's research. From this can be concluded that the general setup for this research can use different datasets from different sources in a general way and achieve similar results in comparison to previous work. As an illustrative example the 10-fold cross-validation accuracy results for the musical instruments dataset and the apparel dataset are shown in Table 9. The musical instruments dataset consists of only 332 examples from which 284 are positive and 48 are negative. The classifier reaches an average accuracy for the musical instruments dataset of 0.8559 with an micro F1 score of 0.9238 and a macro F1 score of 0.9228. As a comparison the apparel dataset is added. The apparel dataset is bigger with 2000 available examples and is balanced with 1000 positive and 1000 negative examples. From the comparison can be seen that the above average performance of 0.8559 for the musical instruments dataset might be caused by the imbalance in the dataset. The performance of 0.7676 for the apparel dataset is a better representation of the general performance of the Naive Bayes classifier. Also the results in Table 9 show that the extra randomization step taken before splitting up the 10-fold makes sense. In earlier experimentation the results varied widely between folds as sometimes a lot of positive or negative examples were present in one of the folds leading to big biases.
Overview 10-fold cross validation musical instruments apparel Run number Accuracy Run number Accuracy 1 0.7879 1 0.7450 2 0.8636 2 0.7550 3 0.8687 3 0.7600 4 0.8788 4 0.7700 5 0.8606 5 0.7750 6 0.8586 6 0.7817 7 0.8615 7 0.7721 8 0.8636 8 0.7725 9 0.8552 9 0.7694 10 0.8606 10 0.7755 total 0.8559 total 0.7676 Table 9: overview of 10-fold cross validation for the musical instrument and apparel datasets.

52

In Table 10 the confusion matrices for the same datasets are presented. From the confusion matrix for the musical instruments dataset can been seen that the imbalance in the dataset leads to a considerable bias for positive classification. The amount of false positives is 42 against only 4 falsely classified negative examples. Compare this to the performance of a classifier trained on a balanced dataset like the apparel dataset the amount of false classified positive against the false classified negative examples is much smaller. It can be expected that in a cross-domain setting these big biases will have a negative effect on the classification performances when a biased classifier is used in a cross-domain setting. Overview confusion matrices musical instruments Prediction Actual TN 5 FP FN 4 TP apparel Prediction 42 Actual TN 638 FP 279 FN 87 TP

362 913

Table 10: overview of the confusion matrices for the musical instruments and apparel datasets.

Looking at the results for the in-domain 10-fold cross-validation for the 2 gram feature-set the results are slightly lower compared to the 1 gram feature-set. Comparing the results to previous research (Pang and Lee, 2002) on te senpol dataset, the reported accuracy for 2 gram classification in their research is 0.773. In this research an accuracy of 0.8123 is reported for the senpol dataset. Taking into account the different datasets used in this research it can be concluded that average accuracy of 0.7739 over all the datasets is a comparable result to previous research. Intuitively one might think that using 2 gram features higher accuracies could be achieved in comparison to 1 gram features. This would be logical because 2 gram features would offer more expressiveness (i.e. in the case of 2 gram features a structure like “not good” would yield a negative implication in comparison to the two separate 1 grams “not” and “good”. In most of the datasets reported in this research 2 gram features have lower accuracy scores in comparison to the 1 gram features. For example for the office products dataset the performance in 1 gram features experiments is 0.8780, the performance of the 2 gram features classifier is only 0.7021 which is a difference of 0.1759. In case where 2 gram features (although slightly) outperform the 1 gram features it seems that these are datasets of considerable size. For example a small increase in performance is found in the electronics dataset and the beauty dataset. In case where datasets are relatively small fewer examples are used for training. Hence, fewer expressiveness is captured. For example for the laptop dataset dataset the performance decreases from 0.7438 for the 1 gram feature-set to 0.6774 for the 2 gram feature-set. The correlation of the size of the dataset doesn't always hold, for example for the tools & hardware dataset. For this dataset there are very few examples available (112 data examples) and this dataset is quite unbalanced (14 negative versus 98 positive examples). This results in quite a unbalanced classifier having a high prior probability for positive examples. The hight of this prior probability will in many cases not be exceeded by the “proof” found in the analyzed features. Because the accuracy values are based on a 10-fold cross validation of the data and the dataset is quite small, the training and evaluation sets are also small subsets which in this case probability doesn't illustrate a representative picture of the performance for this domain. The standard deviation of the results for the 2 gram feature-set is a little higher in comparison to the 1 gram feature-set: 0.0750. This adds to the argument that the classifier trained with 2 gram features is more reliant on the amount of supplied data in comparison to the classifier trained with 1 gram features. The performance is also more dependent on the segmentation of the folds of the data. Certainly in case of smaller training sets it can be the case that certain features are not assigned probabilities as would be 53

intuitive (e.g. because “good” is present in a lot of different 2 gram variations: “good meal”, “good tool”, “good movie”). In this case the distribution of the sentiment of “good” is spread over different 2 gram features. This can result in the misclassification of certain examples, especially when mixed sentiment is expressed in a sentence. The results for the combined 1 gram + POS-pattern feature-set are slightly lower compared to the accuracies reported for the 1 gram and 2 gram feature-sets. The tools & hardware dataset is a surprise here as this was the best performing dataset for the 1 gram and 2 gram feature-sets. The sudden change in performance here can be explained by the aforementioned argument that the complexer the features get, the more data is necessary in order to capture the essence of the domain. In cases where the datasets are small (i.e. for the tools & hardware dataset) performance can decrease considerable, certainly in the case of the POS-pattern feature-set. This argument is further supported by the performance of other smaller datasets (e.g. the lawyer dataset) where the performance decreases when POS-patterns are added. Looking at the performance for the medium sized datasets (e.g. kitchen & housewares, sports & outdoor) the results are comparable to the 1 gram or 2 gram results. Again, to compare to previous research by Pang and Lee (2002) the senpol dataset is used. The performance reported by Pang and Lee (2002) is 0.815, where performance here is a little lower: 0.7779. The difference in performance can be explained by the use of the applied POS-tagger. In the research by Pang and Lee (2002) Oliver Mason's Qtag program is used, where in this research the Brill tagger is used. The combined 1 gram + 2 gram feature-set is the worst performing feature-set considered in the indomain experiment. The performance for the smaller datasets (tools & hardware, lawyer, laptop) is slightly lower in comparison to the 1 gram feature-set, which is logical as the performance for the 2 gram feature-set for those datasets was slightly lower as well. For the medium and large sized dataset the performance is stable in comparison to the 1 gram feature-set, 2 gram feature-set and combined 1 gram + POS-pattern feature-set. To relate the results the senpol dataset is used again (Pang and Lee, 2002). The performance reported by Pang and Lee (2002) is 0.8060, where performance here is a little lower: 0.7819. The difference in performance between the two reported accuracies is less than 3% which is an acceptable difference for the general approach used for this research. Looking at the results for the micro and macro F1 scores for the different feature-sets the differences between the two F1 scores showed to be quite small. This implies that whether a balanced or an unbalanced evaluation-set was used few difference in the final result is observed. From this can be concluded that the classifiers perform quite stable over differently balanced evaluation-sets. Feature analysis To gain more insight in the behavior of the multinomial Naive Bayes classifier and the features on which the classification is performed in Table 11 the top 10 features with the highest probability calculated by Naive Bayes are listed for the kitchen & housewares dataset and the apparel dataset. For the kitchen & housewares dataset the 1 gram features are listed, for the apparel dataset the 2 gram features are listed. Both the top 10 features for the positive features and the negative features are shown.

54

Top 10 1 gram features kitchen&houseware dataset 2 gram features apparel dataset Top 10 Negative Positive Negative Positive feature probability feature probability feature probability feature probability 'the' 0.0547 'the' 0.0474 'of the' 0.0026 'they are' 0.0019 'i' 0.029 'i' 0.0271 'in the' 0.002 'i have' 0.0017 'it' 0.0264 'and' 0.0264 'i was' 0.0017 'in the' 0.0014 'and' 0.0237 'a' 0.024 'i have' 0.0015 'of the' 0.0014 'to' 0.0227 'to' 0.0225 'and the' 0.0014 'and the' 0.0013 'a' 0.0215 'it' 0.0217 'it is' 0.0014 'i am' 0.0013 'of' 0.0149 'is' 0.015 'it was' 0.0013 'and i' 0.0013 'this' 0.0132 'this' 0.0133 'on the' 0.0013 'this is' 0.0012 'is' 0.0129 'for' 0.0128 'i am' 0.0012 'i was' 0.0012 'in' 0.0101 'of' 0.0128 'they are' 0.0011 'it is' 0.0011
Table 11: Top 10 positive and negative features for the kitchen & housewares dataset and the apparel dataset.

As can be seen in Table 11 the most appearing features in both positive and negative 1 gram features are features which we could expect to be prevalent in the English language. Articles like “the” and “I” ofter occur in both positive and negative annotated examples in the dataset. For the 2 gram features the same observation can be made: 2 grams like “of the” and “they are” are prevalent. The top 10 features of the POS-pattern features will be more informative opposed to the top 10 features seen for 1 gram and 2 gram features. The features are listed in the Table 12 below. Top 10 features from the restaurant dataset Negative Positive feature probability feature probability 'n't know' 0.0023 'great place' 0.0037 'not worth' 0.0018 'great food' 0.0032 'never go' 0.0018 'n't wait' 0.0026 'n't have' 0.0016 'ever had' 0.0026 'never return' 0.0015 'very good' 0.0025 'not get' 0.0015 'n't be' 0.0023 'first time' 0.0013 'several times' 0.0019 'n't waste' 0.0011 'good food' 0.0019 'only thing' 0.0011 'n't know' 0.0018 'n't eat' 0.0011 'great service' 0.0018
Table 12: Top 10 positive and negative features for the restaurant dataset.

In contrast to the features assigned high probability for the 1 gram and 2 gram multinomial Naive Bayes classifiers the features in Table 12 are intuitively more accurate denoting positive and negative sentiment. The features “not worth” and “never go” are very likely to denote a negative signal in a restaurant review. The features “great place” and “great food” are clear signals for positive sentiments. Although some of the features could be argued to be applicable in general (non-domain specific features) (e.g. “not get”, “very good), other features seem to be only representative to the specific domain (domain-specific features), in this case the restaurant dataset. For example “good food” is a feature probably used in sentences like “Overall a good time and good food” which would clearly be a 55

sentence having positive sentiment, but only within the context of restaurant (or domains considering food) reviews. These features are highly domain specific and therefore misclassification can be expected in a cross domain setting when using these features as proof for a positive sentence. From the observations made in the previous two tables a possible problem rises when calculating generalizable features using co-occurrence (FCE). As can be seen from the tables common terms (nonsentimental, non-domain specific) are prevalent. When calculating generalizable features based on FCE it can be expected that these common terms will achieve high rankings instead of sentimental nondomain specific terms as attempted. Because the multinomial Naive Bayes classifier assigns probability to each feature based on the number of times the feature has been detected in the examples it could be expected that the top 10 features would look like the examples observed. The number of times the feature has been detected could be seen as how much “proof” there is for a feature to be used in a positive or negative context. Ideally this would lead to considerable difference in “proof” for features having sentimental connotations (e.g. good or bad). Therefore, the behavior of the classifier can better be explained by the probability assigned to the features for which different values would be expected. In Table 13 an example is provided with the probability assigned to the features “good” versus “bad” and “very good” versus “very bad” for different datasets. Dataset restaurant stanford camera & photo baby good bad very good very bad negative positive negative positive negative positive negative positive 0.002 0.0039 0.0018 0.0005 0.00011 0.00035 0.00002 0.00001
Table 13: Probabilities for the features “good” versus “bad” and “very good” versus “very bad” in a positive and negative context for different datasets.

In Table 13 can be seen how the multinomial Naive Bayes classifier segregates examples based on the difference in probabilities assigned to different features which have a sentimental connotation. In the first column the difference in probability for the feature “good” is shown for the restaurant dataset. As can be seen the feature “good” has a probability three times as high for occurring in a positive context compared to the probability of the feature “good” occurring in a negative context. This way the probabilities contribute to “proof” of a sequence of features being positive and negative. Another example for the feature “bad” is provided originating from the stanford dataset where “bad” has a probability which is twice as small of being used in a positive context in comparison to a negative context. For the 2 gram feature-set in Table 13 the same observation can be made. The feature “very good” has a probability three times as high for occurring in a positive context compared to the probability of the feature “very good” occurring in a negative context. Another example for the feature “very bad” is provided originating from the baby dataset where “very bad” has a probability which is 0.7 times as small of being used in a positive context in comparison to a negative context. From this behavior of the multinomial Naive Bayes classifier we can conclude that the classifier uses probability of occurrence within sentimental context in order to assign probabilities in the Naive Bayes model. The more present a term is in one context in contrast to the presence of that term in an opposite context can be used in order to detect terms which are sentimental (within context). For the proposed improvement of selecting generalizable features this principle (using TF-IDF) is used in order to select improved generalizable features.

56

6.1.3 Subjectivity lexicon classifier For the subjectivity lexicon classifier accuracies range from a top performance of 0.7803 for the musical instruments dataset to 0.4963 for the senpol dataset. The observed standard deviation is 0.0676 which is almost as low as the observed standard deviation for the 1 gram feature-set. Although this classifier is the worst performing one, it is the only classifier which can be expected to perform consistently in a cross-domain setting as no training-data is used. The features used in this classifier are a (semi) manual composed lists of terms which are not trained from any form of source data used during the experiments. The classifier performs comparable to the top performing classifiers in only one case (musical instruments) but has lowest performance on many other datasets (music2, camera, video) this might be explained by the more domain specific nature of the datasets where low performance is observed. Especially in the case of the camera and video dataset it can be expected that features which are domain specific contribute more to the overall expressed sentiment in the text in contrast to the general features expressing sentiment as used in a subjectivity lexicon. In Table 14 the confusion matrices for the musical instrument and the apparel datasets are presented. For the confusion matrix for the musical instruments dataset can be seen that although there's an imbalance in the dataset, the classifier doesn't give an unbalanced outcome in the amount of false positives and false negatives. This is of course caused by the fact that the subjectivity lexicon classifier doesn't use data to train, but uses a a predefined list of terms. The amount of false positives is 49 against 31 falsely classified negative examples. If we compare this to the performance of a classifier trained on a balanced dataset like the apparel dataset surprising enough there does seem to be a big positive bias as 691 examples are wrongly classified as being positive. This is caused by the language used in the apparel review dataset: the language to described positive examples is probably not general enough in order to be captured by the subjectivity lexicon classifier. Overview confusion matrices musical instruments Prediction Actual TN 16 FP FN 49 TP apparel Prediction 31 Actual TN 309 FP 234 FN 146 TP

691 854

Table 14: overview of the confusion matrices for the musical instruments and apparel datasets.

Comparison to previous work for the subjectivity lexicon classifier is somewhat limited as no research is available covering the amount of datasets used for this research. The results can however be compared to Das and Chen (2001) who report a performance of 0.620 on financial news datasets with a subjectivity lexicon. Considering the average accuracy of 0.6094 over the 38 datasets these results seem to correlate to the accuracies reported for this research. Although the subjectivity lexicon classifier has shown to be outperformed by all feature-sets considered for the Naive Bayes classifier the observed difference between the micro and macro F1 score is smaller in comparison to the Naive Bayes classifier. From this can be concluded that more stability is achieved by the subjectivity lexicon classifier in case of differences between balanced and unbalanced evaluation-sets. This can be explained by the lack of need for training-data because of the use of pre defined list with sentimental terms. The intuition for using a subjectivity lexicon in combination with the adapted Naive Bayes classifier originates from this performance of the subjectivity lexicon. The consistent and data-independent performance of the use of a predefined list of sentimental terms could 57

very well be used by the adapted Naive Bayes classifier being dependent on qualitative sentimental terms. 6.2 Cross-domain baseline experiments 6.2.1 Naive Bayes classifier For the second experiment conducted for this research the cross-domain performance of the Naive Bayes classifier is explored using a naive cross-domain approach. For each (target) domain classifications are made with all other individual in-domain trained classifiers (using all available data for that domain). In this section the different results for the different feature-sets are presented. After the results for the different feature-sets a more detailed analysis is provided for a sub-section of the domains for reference against the methods considered in the next set of experiments. Finally tables are presented giving insight in domain-specific features based on the different considered datasets in this research. As could be expected in a cross-domain setting the performance of the classifier decreases considerably for all considered feature-sets. For the 1 gram feature-set the performance decreases from a 0.7901 average for in-domain to an accuracy of 0.6477 for cross-domain, a decrease in accuracy of 0.1424, the same goes for the 2 gram feature-set. This results adds to the case that it is only limited possible to use classifiers trained on one domain directly to the use for another domain. The best performing domain in this sense is the restaurant dataset with an accuracy score of 0.7624 and micro and macro F1 scores of 0.8416 and 0.8412. These results probably say something about the generality of the dataset. In a more general dataset (where more generalizable features are used to denote sentiment), the averaged classification from all available classifiers is likely to have a higher accuracy compared to a less general dataset. An illustrative example for this generality of the dataset is added below: "1","Ambiance is great - service wonderful, and food really excellent. We were all very happy with the entire experience" Examining the 1 gram features the majority of the features found in this example seem to be domain independent: “great”, “wonderful”, “excellent”, “happy” are all very likely to be domain independent features signaling positive sentiment.
Dataset 1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro small sized datasets 0.6478 0.6963 0.6916 0.6497 0.6995 0.6944 0.6283 0.6661 0.6611 0.6237 0.6597 0.6543 medium sized datasets 0.6563 0.7034 0.7022 0.6489 0.6930 0.6919 0.6349 0.6718 0.6706 0.6263 0.6547 0.6535 big sized datasets 0.5754 0.5774 0.5767 0.5463 0.5234 0.5229 0.5610 0.5415 0.5408 0.5279 0.4571 0.4567 Table 15: relation between the performance and the size of the dataset for the naive cross-domain approach.

As observed earlier, for both the 1 gram feature-set and the 2 gram feature-set the stanford dataset is a problematic dataset, the lowest overall accuracy is reported for this dataset. In general, considering Table 15, looking at the smaller datasets (tools & hardware, laptop, lawyer) performance can be observed which is around average for this experiment. An exception here is the performance of the tools & hardware set for the 2 gram feature-set. For the 1 gram feature-set this was an average performing dataset, for the 2 gram feature-sett the performance is restored to an average accuracy of 0.7170, which is above average. This can be explained by over-fitting, probably a lot of data-specific 58

examples containing the right 2 gram features are captured. The same goes for the medium sized dataset: kitchen & housewares, sports & outdoors. The average accuracy is specifically lower for the bigger (stanford) dataset. This can be explained by the fact that apart from the domain in terms of subject, this domain also differs in the sense that the data doesn't consist of reviews but of tweets. As discussed in Chapter 4 the used data: tweets, are significantly different from reviews in the average number of words used and the average number of sentences used per document. Therefore, crossdomain performance on datasets of tweets can be expected to be sub-optimal. Another dataset where cross-domain performance is considerable below average for both the 1 gram and 2 gram feature-set is the senpol dataset with the lowest reported micro F1 and macro F1 scores: 0.5465 and 0.5447. For both the 1 gram feature-set and the 2 gram feature-set a considerable decrease in accuracy performance is observed for this dataset. This can also be explained by the differences in provided data: the senpol dataset has a relatively large average amount of sentences per provided document which implies more complexity in the (domain-specific) features which are harder to capture in a cross-domain setting. The averaged accuracy of the combined 1 gram + POS-pattern feature-set is slightly worse in comparison to the 1 gram and 2 gram feature-sets: 0.6273. The reported standard deviation is 0.0367 which is slightly better in comparison to the 1 gram and 2 gram feature-sets. This makes this combination the most stable performing feature-set. For both combinations of features performance of the small sized dataset is around average, just as the medium sized datasets, the bigger (twitter) dataset reports lower performance. The issues raised for the in-domain tasks in regard to the POS-pattern feature-set are even more relevant for the cross-domain task. Combining all different classifiers can leads to a behavior where a lot of domain-specific features (feature-combinations) skew the classification. In some individual classifiers, this can lead to a better performance (when for example by accident there is a good match between source and target domain). In the majority of the cases adding POS-pattern features to the classifier doesn't aid the classification accuracy scores. In comparison to in-domain performance the decrease in accuracy is considerable: from 0.7557 to 0.6273. As concluded for the combined 1 gram + POS-pattern feature-set, using a combination of 1 gram and 2 gram features doesn't help improve cross-domain classification accuracies. On average an accuracy of 0.6178 is reported with average micro F1 and macro F1 scores of 0.6404 and 0.6382 which are the lowest reported number over all the different considered feature-sets. In comparison to in-domain performance in terms of accuracy performance decreases from 0.7466 to 0.6178. In general, the observed differences between the reported micro and macro F1 scores are a little higher in comparison to in-domain performance. From this can be concluded that in a cross-domain setting the difference in performance between evaluation of balanced and unbalanced evaluation-sets is more of an issue. This is probably caused by the different priors which for the different domains all add to the overall classification, it can be expected that while performing a domain-adaptation this issue will be resolved. A comparison to other research can be made for this classifier as Whitehead and Yaeger (2009) provide a cross-domain analysis on the multi-domain dataset covering a subset of the datasets used in this research. Whitehead and Yaeger (2009) perform classifications with SVM and report an accuracy of 0.61 on the same approach chosen in this thesis, the LOO approach. For this research an accuracy of 0.6383 is found over the same subset of datasets. For this research however, more datasets are taken into account in comparison to Whitehead and Yaeger (2009). Another important difference is the used classifier: here multinomial Naive Bayes is used, Whitehead and Yaeger use SVM. Although the differences, the reported accuracies results for the same datasets are comparable.

59

Analysis In order to compare results consistently between different methods in Table 16 the classification results for a classifier trained on a single domain and evaluated on (a subset of) multiple domains are presented. The classification results on different datasets are shown from the classifier trained on the kitchen & housewares dataset. The considered evaluation-sets are: dvd, sports & outdoor, health & personal care, apparel and electronics. With this subset of datasets it is possible to compare results consistently between different feature-sets and between different applied methods (for the remainder of this chapter). The six datasets are all the same in size (2000 documents) and have the same balance (1000 positive vs 1000 negative examples). The results shown are the accuracies and the micro F1 and macro F1 scores for the different cross-domain classifications and are shown in Table 16.
Kitchen & houswares Cross-Domain Baseline Featureset sports & health & dataset dvd electronics average outdoor personal care apparel 1gram accuracy 0.6485 0.7680 0.7705 0.7415 0.7315 0.7320 f1 0.5531/0.5489 0.7844/0.7844 0.7570/0.7559 0.7733/0.7722 0.7692/0.7679 0.7274/0.7259 2gram accuracy 0.6515 0.7560 0.7405 0.7270 0.7310 0.7212 f1 0.6548/0.6525 0.7843/0.7836 0.7574/0.7569 0.7665/0.7662 0.7707/0.7700 0.7467/0.7458 1 gram + POS accuracy 0.6455 0.7585 0.7445 0.7310 0.7125 0.7184 f1 0.5381/0.5342 0.7652/0.7647 0.7499/0.7481 0.7561/0.7547 0.7591/0.7579 0.7137/0.7119 1 gram + 2 accuracy 0.6660 0.7500 0.7480 0.6750 0.7445 0.7167 gram f1 0.6447/0.6423 0.7674/0.7671 0.7200/0.7188 0.7402/0.7398 0.7655/0.7646 0.7276/0.7265 Table 16: a subsection of the individual classification results on the kitchen & housewares dataset.

As can be seen in Table 16 classification results from the on the kitchen & housewares trained classifiers have little differences in the performance of the different feature-sets. The best performing feature-set in terms of accuracy is the 1 gram feature-set with an accuracy of 0.7320, the highest reported micro and macro F1 score is for 2 gram feature-set however. The lowest performance is observed for the combined 1 gram + POS-pattern feature-set. Looking at the different domains the results vary from evaluation-set to evaluation-set. The dvd evaluation-set performs for all different feature-sets under average. The sports & outdoor set and health & personal care set perform for all the feature-sets consistently above average, which says something about the generality of the domain (and the possibility of the use of this domain in a cross-domain setting). The micro and macro F1 scores show very little difference between them, which can be expected as the evaluated datasets are all perfectly balanced. As a result, it can be expected that the evaluation-sets are near perfectly balanced as well, making few differentness between micro and macro F1. It can be expected that the sports & outdoor and health & personal care domains will achieve the best performance when used for a domain-adaption. For the following experiments conducting domain-adaptations this table will be used as a reference to compare against. To take a detailed look at the domain-specific features this research is concerned about in Table 17 the different probabilities assigned by different domain-specific trained classifiers are shown for the 1 gram features “surprise” and “easy”. Upon inspection of the data these two features seemed to be used in a different sentimental context for different domains, making them domain-specific features.

60

Dataset apparel baby beauty books camera_&_photo camera cell_phones_&_service computer_&_video_games dvd electronics grocery health_&_personal_care kitchen_&_housewares magazines music2 music outdoor_living senpol senscale sports_&_outdoors toys_&_games video

Surpise Easy Negative Positive Negative Positive 0.00013 0.00009 0.00015 0.00081 0.00004 0.00002 0.00074 0.00284 0.00006 0.00004 0.00010 0.00066 0.00004 0.00004 0.00025 0.00052 0.00004 0.00005 0.00053 0.00189 0.00005 0.00006 0.00090 0.00364 0.00005 0.00004 0.00053 0.00185 0.00004 0.00006 0.00058 0.00075 0.00007 0.00010 0.00009 0.00019 0.00005 0.00005 0.00047 0.00158 0.00013 0.00011 0.00020 0.00081 0.00005 0.00005 0.00047 0.00178 0.00009 0.00009 0.00031 0.00209 0.00006 0.00004 0.00012 0.00046 0.00005 0.00005 0.00011 0.00017 0.00009 0.00022 0.00022 0.00009 0.00006 0.00005 0.00030 0.00289 0.00014 0.00016 0.00014 0.00018 0.00016 0.00017 0.00026 0.00047 0.00004 0.00004 0.00042 0.00209 0.00007 0.00007 0.00050 0.00135 0.00005 0.00008 0.00009 0.00022

Table 17: the probabilities assigned by different classifiers for the 1 gram features “surprise” and “easy”

As can be seen in Table 17, the 1 gram feature “surprise” can have different connotation in different domains. For some domains this feature has a positive connotation (e.g. camera & photo, computer & video games, dvd). In other domains the feature has a negative connotation (e.g apparel, baby, beauty). A less obvious example is the 1 gram feature “easy”. In the majority of the domains the 1 gram feature “easy” is observed in a positive connotation (apparel, baby, etc.) but in the case of the music dataset the feature “easy” has a more negative connotation. Ideally, performing a domain-adaptation this domaindependency would be captured by the generalizable features. In the analysis of the following experiments will be evaluated if this domain-specificity is captured by the using generalizable features. 6.2.2 Subjectivity lexicon classifier When considering the subjectivity lexicon classifier the same numbers hold as observed for the indomain subjectivity lexicon experiments. The subjectivity lexicon classifier doesn't use any data from any of the domains so the performance considering in- or cross-domain remains the same. The average acquired accuracy therefore remains, as in the previous experiment 0.6092 with is the lowest among the other cross-domain experiments. Although the accuracy is quite low, micro F1 and macro F1 scores are the highest observed for this experiment: 0.7065 and 0.7049. As observed for the in-domain experiment the subjectivity lexicon classifier has shown to be outperformed by all feature-sets considered. In spite of this the observed difference between the micro and macro F1 score was smaller in comparison to the in-domain experiments. For the cross-domain experiments the difference between micro and macro F1 score are even bigger. From this can be concluded that more stability is achieved by the subjectivity 61

lexicon classifier in case of differences between balanced and imbalanced evaluation-sets. 6.3.1 Adapted Naive Bayes using FCE As a first attempt to perform a cross-domain adaptation an experiment is conducted using the adapted Naive Bayes classifier using FCE to select generalizable features. In this section the results for this classifier are presented. First, the results for the different considered feature-sets are presented. Next, a detailed analysis is provided giving insight in the used generalizable features, the results for a subsection of the data, and the influence of the lambda parameter and different numbers of generalizable features used. The results for the adapted Naive Bayes classifier using FCE to select generalizable features are quite disappointing in comparison to the results reported by Tan et al. (2009) and in comparison to the naive cross-domain baseline. In Tan's research micro and macro F1 scores are reported of 0.8262 and 0.7962 on average (over 6 Chinese datasets). The best reported micro and macro F1 scores in this research are 0.6428 and 0.6413 for the combined 1 gram + POS-pattern feature-set. This difference in performance can be explained by a number of reasons. The used domains can be an issue, as can be seen in the experiments performance measures can vary quite from domain to domain implying that certain domains are quite suitable to perform a domain-adaptation, other domains show results far under average making them very unsuitable. As expected the difference between micro and macro F1 score are smaller in comparison to the naive cross-domain approach showing that the priors are fitted to the target-domain when performing a domain-adaptation.
Dataset 1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro small sized datasets 0.5054 0.4562 0.4540 0.5532 0.5599 0.5579 0.5882 0.6129 0.6095 0.4943 0.4421 0.4369 medium sized datasets 0.5004 0.4323 0.4301 0.6239 0.6611 0.6589 0.5661 0.6513 0.6503 0.5520 0.5133 0.5120 big sized datasets 0.5548 0.6065 0.6046 0.6371 0.6905 0.6883 0.5944 0.6723 0.6719 0.5671 0.6696 0.6691 Table 18: relation between the performance and the size of the dataset for ANB using FCE.

Sometimes the supplied data appears to be to small to perform a suitable domain-adaptation. For example, as can be seen in Table 18 in the observed results for the 1 gram feature-set smaller sized and unbalanced datasets show considerable worst performance. For the medium and bigger sized data-sets these performance issues are less prevalent. In general it seems that mainly small sized datasets have a very negative influence in the overall performance of the ANB classifier using FCE. In contrast the bigger sized dataset all seem to increase in performance in comparison to naive cross-domain classification. It can be concluded that unbalanced datasets in general benefit from using ANB using FCE in comparison to the naive cross-domain approach. As for the in this research used 38 datasets the majority of the datasets is of medium size (around 2000 examples), for the medium sized datasets the performance varies from dataset to dataset. This can be explained by the overall generality of the dataset used, some domains are more suitable for domain-adaptation than others. Focusing on the performance of the bigger sized datasets it can be concluded that supplying ANB using FCE with more data has a beneficial effect on the performances. Next to the data, the generalizable features (which in this case originate from data) can play an important role, which will be explored in the next section. Analysis The performance of the adapted Naive Bayes classifier using FCE to select generalizable features 62

performs considerable lower than expected. In Tan et al. (2009) for the same method results are reported with micro F1 and macro F1 scores of 0.8262 and 0.7926 on Chinese datasets. Although relating these results is difficult the micro and macro F1 scores observed for this experiment are considerably lower. As an example gaining more insight into the behavior of the classifier a subsection of the domain-adaptation results for the kitchen & housewares classifier are presented. As a representative subsection, five different domain-adaptations of the kitchen & housewares classifier are chosen to be displayed in Table 19. The results shown are the accuracies and the micro F1 and macro F1 scores for the different classifications of the domain-adapted classifiers. For this experiment 500 generalizable features are used, lambda is set at 0.6 and 1 EM iteration is performed.
Kitchen & houswares ANB (FCE) Featureset sports & health & dataset dvd electronics average outdoor personal care apparel 1gram accuracy 0.4975 0.5450 0.5385 0.5005 0.5175 0.5198 f1 0.1083/0.1075 0.3793/0.3769 0.3514/0.3490 0.3791/0.3791 0.3889/0.3879 0.3214/0.3201 2gram accuracy 0.5445 0.7070 0.7115 0.7070 0.6870 0.6714 f1 0.4908/0.4903 0.7446/0.7436 0.7529/0.7517 0.7570/0.7540 0.7295/0.7292 0.6949/0.6938 1 gram + POS accuracy 0.4520 0.4980 0.4910 0.4925 0.4885 0.4844 f1 0.2956/0.2950 0.4975/0.4969 0.5120/0.5115 0.5889/0.5880 0.4888/0.4866 0.4766/0.4756 1 gram + 2 accuracy 0.482 0.5925 0.5865 0.6355 0.562 0.5717 gram f1 0.0700/0.0690 0.4437/0.4399 0.4229/0.4224 0.5530/0.5524 0.4089/0.4073 0.3797/0.3782 Table 19: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset

The chosen subset in Table 19 is a descent representation of the general trend observed in the results presented in Chapter 5. The results for the combined 1 gram + POS-pattern feature-set is the lowest among the different feature-sets: 0.4844. The highest performing feature-set is the 2 gram feature-set with an accuracy of 0.6714. Compared to the naive cross-domain approach the reported performances are quite disappointing. All feature-sets lose considerably in accuracy compared to naive cross-domain, the smallest difference is for the 2 gram feature-set with a difference in accuracy of 0.0498, the biggest difference is observed for the combined 1 gram + POS-pattern feature-set with a difference in accuracy of 0.234. The results for the adapted Naive Bayes classifier using FCE seem to vary from domain to domain. As observed for the naive cross-domain baseline the sports & outdoor and health & personal care datasets show on average to be the most suitable for domain-adaptations. This is especially the case for the 2 gram feature-set where performances come near naive cross domain performance. For all individual domain-adaptations the adapted Naive Bayes classifier using FCE doesn't seem to be performing better in comparison to the naive cross-domain approach. This could be explained by the used generalizable features, which will be examined in the next section. Generalizable features As described the adapted Naive Bayes method uses generalizable features selected by FCE between domains in order to perform a domain-adaptation. To illustrate these generalizable features in Table 20 the top 50 most common generalizable 1 gram features for the different domain-adaptations for the kitchen & housewares dataset are presented.

63

you not several zip wider with is out your widening was in nice you're widened very i my years when to have it wrung well this for else would't weighty the and control work weighs that a but wires wasted on would are windstar waste of so all will warping Table 20: Top 50 most common generalizable 1 gram features using for the kitchen & housewares dataset

In Table 20 the top 50 generalizable features from the domain-adaptations for the kitchen & housewares dataset are shown for the 1 gram feature-set. It is noticeable that a lot of very common terms are present in the results. A lot of non-sentimental domain-independent terms like: “you”, “on” and “of” are among the results. It would be intuitive that sentimental, domain-independent terms like “good”, “excellent”, “poor” and “bad” would be among the results. When comparing these results to the terms reported by Tan et al. (2009) (See Table 1) the conclusion can be drawn that these results leave room for improvement. There are 7 terms which could be considered to be sentimental: “very”, “not”, “so”, “nice”, “wrung”, “waste” and “wasted”. In comparison to the 1 gram features the 2 gram features from the domain-adaptation for the kitchen & housewares dataset show the same issues. To gain more insight in the relation of this performance and the generalizable features used for the 2 gram experiments the Top 50 generalizable features used for the kitchen & housewares domain-adaptations are presented in Table 21.
to the in my you toss without a weeks yes would not i would you pay with the we had that will i have yogurt after with i wash so so this i bought yes it with a wash first on the for a years it will never was looking of these because i wouldnt turn will keep very well my wife and the worth the which point very small looking for and it worked for which i very pleased it is all over without hot what i very carefully is a your juice without any well as used the Table 21: Top 50 most common generalizable 2 gram features using for the kitchen & housewares dataset

Looking at the results for the 2 gram generalizable features the features intuitively make more sense. Of the terms in Table 21 eight could be considered to be sentimental features: “without any”, “very well”, “without a”, “very small”, “very pleased”, “very well”, “the first” and “the best”. Just as for Table 20 the observation can be made that a lot of non-sentimental general terms are present like: “to the”, “in my” and “so this”. In the work by Tan et al. (2009) the importance of the generalizable features is stressed in order to allow a domain transfer. A possible explanation of the disappointing results can be the quality of the used generalizable features. Therefore, the focus of this thesis lies in the attempt to explore methods of selecting better generalizable features comparable to the features reported in Table 1.

64

Influence of Lambda To gain insight in the behavior of the lambda parameter which controls the impact between the sourcedomain and the target-domain an additional experiment is performed measuring the effect of different values for this parameter. Lower values for lambda put the emphasis on the source-domain data, higher values for lambda put emphasis on the target-domain. It makes sense to evaluate different values for this parameter as there is likely to be an optimal balance between the use of source-domain data and target-domain data using the adapted Naive Bayes classifier. An argument to put more weight on the source-domain data could be that, while performing a domain-adaptation, the source-domain data is the only data from which (human annotated) labels are used to train the classifier. The target-domain data is considered to be unlabeled making the source-domain data the only reliable source of information, the target-domain data is expected to have a similar data-distribution but this is an assumption rather than a certainty. However, the target-domain data is the data where evaluation is performed over, so an argument to put more weight towards the target-domain data could be that assuming both source and target domain have similar distributions the target-domain observations should be more important in fitting the Naive Bayes model. To proper evaluate these hypothesis the experiment performed by Tan et al. (2009) is repeated evaluating different values for the lambda parameter. In this experiment the parameter is evaluated for different values starting at a value for lambda of 0.05 up to a lambda value of 1.0. In order to evaluate the performance consistently N FCE is set to 500 and only 1 EM iteration is performed. Because fitting cross-domain classifiers is quite time-consuming, especially for all the 38 datasets used in the full experiments for this research the choice is made to use the same representative subset of datasets as before to evaluate the behavior of the lambda parameter. For each feature-set (1 gram, 2 gram, combined 1 gram + POS-pattern, combined 1 gram + 2 gram) a graph is plotted containing the different values for the lambda parameter against the performance of the classifier. The performance of the classifier is expressed in three different measures: accuracy, micro F1 and macro F1 scores. In Figure 1 the different graphs are shown for the adapted Naive Bayes classifier using FCE.

Performance of 1 gram ANB (FCE) vs Lamba
0.6 0.55
Performance

0.5 0.45 0.4 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

65

Performance of 2 gram ANB (FCE) vs Lambda
0.73
Performance

0.71 0.69 0.67 0.65 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 1 gram + POS ANB (FCE) vs Lambda
0.7 0.65
Performance

0.6 0.55 0.5 0.45 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 1 gram + 2 gram ANB (FCE) vs Lambda
0.65
Performance

0.55 0.45 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

accuracy

micro F1

macro F1

Figure 1: Performance of the ANB classifier using FCE with different feature-sets versus different values for the lambda parameter.

Looking at the different graphs plotted in Figure 1 for all the feature-sets is noticeable that the highest performances are observed at the starting of the graph at lambda values around 0.05-0.2. For the 2 gram feature-set the performance stays stable up until a lambda of 0.2 For all the feature-sets lambda's greater than 0.2 show slowly declining performances. The decline in terms of micro and macro F1 scores is sharper in comparison to the decline in accuracy. 66

In contrast to the findings reported by Tan et al. (2009) where peaks are reported at higher values of lambda, in this research peak performances are reported for values of lambda around 0.05. This indicates that mainly the old domain data is important for cross-domain adaptation for English datasets. The source-domain data is labeled data which is important because we can be certain of the semantic orientation. The target-domain holds the same amount of important information but (in the crossdomain setting experiment) is considered to be unlabeled. Intuitively it makes sense that putting the emphasis on the target-domain data (with higher values for lambda, around 0.8) results in an optimal fitting of the classifier. However, using FCE as a bridge between the two domains doesn't seem to aid performance measures. The difference in observations can be explained by differences in the data or by the sub-optimal performance of FCE to select (enough) useful generalizable features. For the POS-pattern features for instance it is observed that sometimes, especially for smaller datasets very little POS-patterns are found containing an adverb or adjective meeting the pattern requirements in the limited available trainingdata. Therefore, it can be the case that the performance curve is more controlled by the availability and the quality of the features in the data instead of being controlled by the influence of lambda. As has been shown in Table 16 and Table 19 the difference between the domains in a cross-domain adaptation setting are quite big so the selection of qualitative generalizable features is necessarily. For the suggested improvements this experiment will be repeated gaining insight whether improving the generalizable features does follow the case made in the research by Tan et al. (2009) that assigning higher weight to unlabeled new-domain data aids to classification performance. Influence of number of used generalizable features In the research by Tan et al. (2009) the claim is made that the generalizable features play an important role in the performance of the domain-adaptation. The generalizable features are used to provide a domain-adaptation from source-domain to target domain and are the main suggestion for improvement in this thesis. To explore the importance of the influence of the generalizable features and the number of generalizable features used an additional experiment is set-up. This experiment is based on the experiment performed by Tan et al, (2009) and is setup in order to gain insight in the behavior of the classifier with different numbers of used generalizable features. In this experiment the number of generalizable features are varied between 50 and 1200. To consistently measure the influence of the number of generalizable features the lambda parameter is kept at 0.6 and EM is performed for 1 iteration. Just as the previous experiment the same subset of datasets is used for this experiment. It is expected that the lower the number of generalizable features the lower the classifier performs. It can also be expected that the number of features used has some maximum number as adding to many generalizable features would imply using more and more features which aren't general (but rather domain-specific). To measure performance as for the previous experiment performance is measured in terms of accuracy, micro F1 score and macro F1 score. In Figure 2 the performance of the adapted Naive Bayes classifier is plotted over an increasing number of generalizable features Different featuresets are evaluated: 1 gram, 2 gram, combined 1 gram + POS-pattern and combined 1 gram + 2 gram.

67

Performance of 1 gram ANB (FCE) vs No. Features
0.53
Performance

0.48 0.43 0.38 50 100 200 400 600 800 1000 1200
Number of Features

Performance of 2 gram ANB (FCE) vs No. Features
0.69
Performance

0.68 0.67 0.66 0.65 50 100 200 400 600 800 1000 1200
Number of Features

Performance of 1 gram + POS ANB (FCE) vs No. Features
0.61
Performance

0.56 0.51 0.46 50 100 200 400 600 800 1000 1200
Number of Features

68

Performance of 1gram+2gram ANB (FCE) vs No. Features
0.57
Performance

0.52 0.47 0.42 50 100 200 400 600 800 1000 1200
Number of Features

accuracy

micro F1

macro F1

Figure 2: Performance of the ANB (FCE) classifier with different feature-sets versus different number of generalizable features used.

Looking at Figure 2 considering the results for the 1 gram and 2 gram feature-sets the performance in terms of accuracy is stable up until the number of features used exceeds 200. With more than 200 features the performance decreases slightly except for the micro and macro F1 scores for the 1 gram feature-set which increase between the 200-800 interval. The performance for the combined 1 gram + POS-pattern and combined 1 gram and 2 gram feature-sets shows a slow increase in performance measures as the number of used generalizable features increases. The increase in terms of micro F1 and macro F1 scores for the combined 1 gram + 2 gram feature-sets is steeper. In Tan et al. (2009) where this experiment is performed on Chinese datasets the performance of the use of different number of features varies from 0.6 to 0.85 in terms of micro F1 and macro F1 scores for different amounts of used generalizable features with performance measures peaking at around 600 used generalizable features. In this research the performance curves are similar in the sense that performance measures increase as the number of used generalizable features increases, except for the 2 gram feature-set. The results differ in the fact that performance in these experiments increases up to the point that performance keeps stable for all evaluated feature-sets. In the results reported by Tan et al. (2009) there's a clearly observed peak in the graph after which performance decreases again. In the experiments conducted in this research such a decrease is not observed rather the results seem to converge to a maximum. As for the lambda experiment this can be explained by the used method to select generalizable features. The used generalizable features in this experiment contain a considerate amount of features which are non-sentimental and which might skew the domain-adaptation. Using more features does seem to increase performance for most of the feature-sets but the converge to a maximum performance all feature-sets show implies that with using more features more nonsentimental features are used as well. 6.3.2 Adapted Naive Bayes using subjectivity lexicon As a first improvement over FCE to select generalizable features the use of a subjectivity lexicon is proposed. Instead of selecting generalizable features with FCE in this experiment a subjectivity lexicon is used to serve as generalizable features. In this section the results for the experiments with this approach are analyzed. First, the results for the different considered feature-sets are presented. Next, a detailed analysis is provided for a sub-section of the considered datasets. As a final experiment the influence of the lambda parameter for this approach is analyzed. In this experiment the number of used 69

generalizable features isn't considered as in contrast to the generalizable features which are calculated as a ranked list, the subjectivity lexicon is a static list. This static list makes no distinction in how generalizable the generalizable features used are. Because of this limitation it makes no sense to explore the behavior of selecting different numbers of features for the adapted Naive Bayes classifier using a subjectivity lexicon as any selecting made in the subjectivity lexicon would be random. When comparing the results of ANB using a subjectivity lexicon it seems that performance for the 1 gram feature-set doesn't increase in comparison to the previous method. Accuracies for the 1 gram feature-set are 0.5015 in comparison to 0.5060 using FCE. Micro and macro F1 are almost identical. As the differences between micro and macro F1 are very small the same observations hold as observed in the previous experiment. Looking at the results for the 2 gram feature-set performance for the subjectivity lexicon based generalizable features shows an insignificant increase in performance. When analyzing the results for the combined 1 gram + POS-pattern feature-set the performance of the subjectivity lexicon based generalizable features is considerable lower in comparison to the generalizable features selected by FCE. The big overall differences for this feature-set can be explained by the use of the POS-patterns. POS-patterns are often very data-specific extracted patterns which have little probability of being present in the very general subjectivity lexicon (e.g. “great camera” or “bad restaurant”). When there are very few features which co-occur in both subjectivity lexicon and the data no good match can be made to facilitate a domain-adaptation. Consequently, the fitting of the model is likely to be quite low. This argument is supported by the low differences in performance for the bigger stanford dataset. This dataset is quite big, containing a lot of different terms so the probability of this data having a match with the subjectivity lexicon is the highest. Performance of the combined 1 gram and 2 gram feature-set shows a small difference in performance compared to ANB using FCE (0.0189). In general it can be concluded that for this experiment there's no improvement using a subjectivity lexicon as generalizable features.
Dataset 1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro small sized datasets 0.4907 0.4196 0.4174 0.5396 0.5189 0.5168 0.4870 0.5132 0.5083 0.3768 0.3257 0.3202 medium sized datasets 0.4990 0.4496 0.4472 0.6360 0.6685 0.6665 0.4685 0.5011 0.5001 0.5699 0.5446 0.5433 big sized datasets 0.5579 0.6236 0.6216 0.6419 0.6941 0.6919 0.5737 0.6293 0.6288 0.5705 0.6811 0.6806 Table 22: relation between the performance and the size of the dataset for ANB using a subjectivity lexicon.

However, looking at the individual datasets who do increase in performance in comparison to ANB using FCE it is noticeable, as shown in Table 22, that the bigger datasets (stanford, senpol, senscale) do show increases in performance. The medium sized dataset seem to be more dependent on the generality of the domain as observed for the experiment conducted with ANB using FCE. Consistent to the previous experiment the smaller sized datasets seem to show most decreases in performance. As observed for the previous experiment, the use of a subjectivity lexicon can aid performance compared to ANB using FCE in case enough (general) data can be supplied. Analysis As an illustrative example a representative subsection of the domain-adaptations for the kitchen & housewares classifier are presented. As a representative subsection, again, five domain-adaptations are chosen to be displayed in Table 23. The results shown are the accuracies and the micro F1 and macro F1 scores for the different classifications by the domain-adapted classifier. The experiment is setup 70

using the same parameters as before.
Kitchen & houswares ANB (SL) Featureset sports & health & dataset dvd electronics average outdoor personal care apparel 1gram accuracy 0.4910 0.5335 0.5180 0.5425 0.5415 0.5253 f1 0.2448/0.2442 0.4356/0.4338 0.4428/0.4420 0.5454/0.5430 0.5285/0.5272 0.4394/0.4380 2gram accuracy 0.5495 0.7115 0.7120 0.7065 0.6880 0.6735 f1 0.5019/0.5018 0.7495/0.7488 0.7538/0.7526 0.7575/0.7544 0.7313/0.7310 0.6988/0.6977 1 gram + POS accuracy 0.4560 0.4735 0.4870 0.4710 0.4680 0.4711 f1 0.2823/0.2819 0.4408/0.4400 0.5034/0.5025 0.5657/0.5651 0.4400/0.4386 0.4464/0.4456 1 gram + 2 accuracy 0.4895 0.6130 0.6345 0.6775 0.6090 0.6047 gram f1 0.1442/0.1429 0.5343/0.5320 0.5743/0.5720 0.6481/0.6463 0.5660/0.5656 0.4933/0.4917 Table 23: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset

The chosen subset in Table 23 is a descent representation of the general trend observed in the results presented in Chapter 5. When comparing these results to ANB using FCE there's a small improvement for all the feature-sets except for the 1 gram + POS-pattern feature-set. The 1 gram feature-set shows an increase in accuracy of 0.0055, the 2 gram feature-set of 0.0021 the combined 1 gram and 2 gram feature-set improves the most with 0.0330. The 1 gram + POS-pattern feature-set decreases in accuracy performance from 0.4844 to 0.4711. In comparison to the baseline naive cross-domain approach performance measures are all still considerably lower. As for FCE the 2 gram feature-set has the smallest difference in performance (0.0477). The 1 gram + POS-pattern feature-set differs the most with a difference of 0.2474. In terms of micro and macro F1 scores the 1 gram and the combined 1 gram and 2 gram feature-sets show more considerable increases compared to FCE. The 2 gram and combined 1 gram + POS-pattern feature-sets show similar differences as observed for the accuracies. The increase in performance for the 1 gram, 2 gram and combined 1 gram and 2 gram feature-sets shows that a little performance increase can be acquired by using a subjectivity lexicon instead of selecting generalizable features by FCE. The decrease in performance for the combined 1 gram + POSpattern feature-set can be explained as POS-pattern features are often very domain-specific features, which are not present in the subjectivity lexicon. The results for the adapted Naive Bayes classifier using a subjectivity lexicon vary from domain to domain. As observed in the previous experiments the sports & outdoor and health & personal care datasets show on average to be the most suitable for domain-adaptations, especially for the 2 gram feature-set where performance comes near naive cross domain performance. For all individual domainadaptations ANB using a subjectivity lexicon doesn't seem to be performing better in comparison to the naive cross-domain approach. From this experiment can be concluded that using a subjectivity lexicon instead of FCE does seem to increase classification accuracies a little but not as much as expected. This can be explained by the use of the adapted Naive Bayes classifier which is specifically designed to use both source and target-domain data, using generalizable features not originating from data can therefore limit the potential of this method. As the generalizable features originate from a subjectivity lexicon there is no guarantee that there will be a proper match between the used features and the domain-data. Another limitation can be the used subjectivity lexicon, as discussed in in section 3.1.2 (Chapter 3), the used subjectivity lexicon is quite large and can therefore contain features which are not necessarily domain-independent in this context to serve as generalizable features. 71

Influence of Lambda As for the previous experiment it makes sense to explore the behavior of the lambda parameter with regard to ANB using a subjectivity lexicon. As the features differ from the previous experiment it can be expected that the performance curves for this classifier are a little different as well. This will have influence on the optimal balance between source and target-domain data. To explore the optimal balance between source and target domain the same datasets, parameter settings and feature-sets are used as the previous experiment. In Figure 3 the different graphs are shown for the adapted Naive Bayes classifier using a subjectivity lexicon.

Performance of 1 gram ANB (SL) vs Lamba
0.6 0.55
Performance

0.5 0.45 0.4 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 2 gram ANB (SL) vs Lambda
0.73
Performance

0.71 0.69 0.67 0.65 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

72

Performance of 1 gram + POS ANB (SL) vs Lambda
0.7 0.65
Performance

0.6 0.55 0.5 0.45 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 1 gram + 2 gram ANB (SL) vs Lambda
0.75
Performance

0.65 0.55 0.45 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

accuracy

micro F1

macro F1

Figure 3: Performance of the ANB classifier using a subjectivity lexicon with different feature-sets versus different values of the lambda parameter.

Considering at the graphs presented in Figure 3 it is noticeable that the same curves are observed for the ANB using a subjectivity lexicon as for the ANB using FCE. For the 1 gram feature-set the main difference is the decrease in micro and macro F1 score which is less steep compared to the ANB using FCE. The curve for the 2 gram feature-set is almost identical to the previous experiment, consistent to the observed results for the 2 gram feature-set in Table 12. The curve for the combined 1 gram + POSpattern feature-set shows on average lower performance measures in comparison to ANB using FCE but shows the same curve. The combined 1 gram + 2 gram feature-set shows on average higher performance and shows little difference between the accuracy line and the micro and macro F1 lines, indicating more stability. When using a subjectivity lexicon instead of automatically generated generalizable features all graphs show a decrease in accuracy when lambda increases to 1.0. In this case no domain-adaptation is performed and only the source-data is used. It seems that using a mixture of data does aid performance, but contrary to Tan et al. (2009) the emphasis lies on the old-domain data. The analysis of the POSpattern feature-set learns that the varying results can be explained by the fact that there is no proper match between the observed POS-patterns in the data (which are sometimes quite few especially when training-data is small) and the used generalizable features. The generalizable features in this experiment are the subjectivity lexicon which holds very general terms. These terms have little relation with the 73

often very domain-specific POS-pattern features. Using EM to perform a domain-adaptation between those two can therefore be expected to show little improvement. Comparing the graphs in Figure 3 to the graphs in Figure 1 (influence of lambda on ANB using FCE) apart from the described differences in performance the curves look very similar. This would imply that the used setting for lambda between these two methods don't make much of a difference. As observed for FCE a possible explanation can be the quality of the generalizable features. Where in the experiment considering ANB with FCE the quality of the generalizable features prohibited a successful domain-adaptation the same conclusion can be drawn for the subjectivity lexicon. The subjectivity lexicon can be concluded to be an insufficient representation of generalizable features for any target-domain. 6.3.3 Adapted Naive Bayes using improved generalizable features The final experiment conducted in this research is an experiment using the adapted Naive Bayes classifier using improved generalizable features. The selection of the generalizable features for this experiment include an importance measure which has been introduced in Chapter 3. The results for the experiment with this classifier are analyzed in this section. First, for all the different considered featuresets the results are presented after which a more detailed analysis is provided for the subsection of the used datasets together with details about the influence of the lambda parameter and the use of different numbers of used generalizable features. Comparing the results observed for the 1 gram feature-set for ANB using a improved generalizable features there is a slight insignificant increase in performance in comparison to ANB using FCE. Micro and macro F1 are almost identical. As the differences between micro and macro F1 are very small the same observations hold as observed in the previous experiment. The increase in performance for the 2 gram feature-set is more significant with 0.0036 showing a slight improvement for using improved generalizable features. As observed for ANB using a subjectivity lexicon the combined 1 gram + POS-pattern feature-set isn't increasing performance measures using improved generalizable features. The big overall differences for this feature-set can be explained by the lack of representative POS-patterns in the data. As discussed before sometimes very few POS-patterns are present in the data making them in these cases unsuitable to perform domain-adaptations. Performance of the combined 1 gram and 2 gram feature-set shows a small difference in performance compared to ANB using FCE (0.0064). In general it can be concluded that for this experiment there's no real improvement using improved generalizable features. Furthermore, the high standard deviation indicates that the stability has decreased and the quality of the domain-adaptation is more dependent on the supplied data.
Dataset 1 gram 2 gram 1 gram + POS 1 gram + 2 gram ac micro macro ac micro macro ac micro macro ac micro macro small sized datasets 0.5024 0.4458 0.4436 0.5579 0.5636 0.5616 0.5026 0.5245 0.5196 0.4986 0.4488 0.4436 medium sized datasets 0.5022 0.4429 0.4408 0.6278 0.6670 0.6648 0.4829 0.5208 0.5197 0.5599 0.5281 0.5267 big sized datasets 0.5528 0.6061 0.6041 0.6346 0.6886 0.6864 0.5822 0.6164 0.6158 0.5678 0.6701 0.6696 Table 24: relation between the performance and the size of the dataset for ANB using improved generalizable features.

However, in comparison to ANB using a subjectivity lexicon all feature-sets show slight increases in performance. As observed for ANB using a subjectivity lexicon the bigger datasets (stanford, senpol, senscale) do show increases in performance, showing the best results observed over all the methods as can be seen in Table 24. The medium sized datasets seem to be more dependent on the generality of the domain as observed in the earlier experiments. Consistent to the previous experiment the smaller sized 74

datasets seem to cause most decreases in performance. As observed for the previous experiment, the use of improved generalizable features can aid performance in case enough (general) data can be supplied. Analysis As an illustrative example a representative subsection of the domain-adaptations for the kitchen & housewares classifier are presented. As a representative subsection, again, five domain-adaptations are chosen to be displayed in Table 25. The results shown are the accuracies and the micro F1 and macro F1 scores for the different classifications by the domain-adapted classifier. The experiment is setup using the same parameters as before.
Kitchen & houswares ANB (IMP) Featureset sports & health & dataset dvd electronics average outdoor personal care apparel 1gram accuracy 0.4975 0.5420 0.5335 0.5000 0.5240 0.5194 f1 0.1083/0.1075 0.3567/0.3550 0.3321/0.3290 0.3582/0.3585 0.4138/0.4129 0.3138/0.3126 2gram accuracy 0.5480 0.7070 0.7110 0.7060 0.6835 0.6711 f1 0.4967/0.4963 0.7437/0.7427 0.7513/0.7503 0.7566/0.7536 0.7254/0.7250 0.6947/0.6936 1 gram + POS accuracy 0.4615 0.4795 0.4825 0.4650 0.4820 0.4741 f1 0.2628/0.2624 0.4290/0.4284 0.4717/0.4708 0.5307/0.5298 0.4264/0.4257 0.4241/0.4234 1 gram + 2 accuracy 0.4820 0.5820 0.5835 0.6360 0.5665 0.5700 gram f1 0.0700/0.0691 0.4121/0.4087 0.4063/0.4054 0.5387/0.5377 0.4185/0.4163 0.3691/0.3674 Table 25: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset

The chosen subset in Table 25 is a descent representation of the general trend observed in the results presented in Chapter 5. When comparing these results to AMB using FCE the results for all the featuresets are almost identical. The performances for the the 1 gram feature-set, 2 gram feature-set and combined 1 gram and 2 gram feature-set are less than 0.002 lower. For the micro and macro F1 scores the same differences are observed. The performance of the 1 gram + POS-pattern feature-set is a little over 0.01 lower in comparison to FCE. In comparison to ANB using a subjectivity lexicon, which performed a little better in comparison to FCE the results are consequently also lower, however the decrease in performance for the combined 1 gram + POS-pattern feature-set is a little less with 0.0103 The results for ANB using improved generalizable features varies from domain to domain. As observed in the FCE experiments the sports & outdoor and health & personal care datasets show on average to be the most suitable for domain-adaptations, especially for the 2 gram feature-set where performance come near naive cross domain performance. For all individual domain-adaptations ANB using improved generalizable features doesn't seem to be performing better in comparison to the naive crossdomain approach and in comparison to ANB using FCE. Just as for the previous two experiments it can be concluded that the attempt to perform a successful domain-adaptation has failed. Just as for the previous experiments the cause can be found in the used generalizable features which will be further explored in the next two sections. Generalizable features To illustrate the generalizable features generated by the improved method in Table 26 the top 50 most 75

common generalizable 1 gram features for the different domain-adaptations for the kitchen & housewares dataset are presented.
very several wrung wicks warping out recommend would't white want good nice worked whereby walmar even make work welded waiste years it's won't weighty w well great wires weighs voided use control windstar week versatility up back wider wasted usually time zippers widening waste using think zip widened washings used Table 26: Top 50 most common generalizable 1 gram features using for the kitchen & housewares dataset

In Table 26 the calculated improved generalizable features are shown for the domain-adaptations for the kitchen & housewares dataset. Comparing these features to the observed features in Table 20 (FCE) the observed features for the improved method have increased in the number of sentimental terms. For the 1 gram features sentimental domain-independent terms like “good”, “well”, “recommend”, “nice”, “great”, “wrung”, “wasted”, “waste”, “voided” and “versatility” are present among the resulting terms which are more terms in comparison to the features observed in Table 20. Although the terms reported in Table 26 seem to be an improvement over the results reported in Table 20, still plenty of common non-sentimental terms are present. Examples of non-sentimental domainindependent terms in Table 26 are “years”, “use”, “up” and “time”. To gain more insight in the relation of this performance and the generalizable features used for the 2 gram feature-set the top 50 generalizable features used for the kitchen & housewares domainadaptations are presented in Table 27.
want to don't waste yes it without any very small use to bought this years it without a very pleased use it all over wrong with will never very carefully to use a very wouldnt turn which point used the my wife a lot would recommend well as used it looking for a huge worth the way to used for it went a good worked great was told use while i think a bit worked for was looking use this i bought you won't won't be very well use in easy to you pay without hot very very use a Table 27: Top 50 most common generalizable 2 gram features using for the kitchen & housewares dataset

Looking at the reported 2 gram features in Table 27 the same observation can be made. Sentimental domain-independent examples are: “easy to”, “don't waste”, “a lot”, “a huge”, “a good”, “wrong with”, “would recommend”, “worked great”, “will never”, “very well”, “very small” and “very pleased”. Observed non-sentimental domain-independent terms are (among others): “use to”, “use it” and “i think”. Compared to Table 21 more sentimental terms seem to be present in this set as well. Although the quality of the selected generalizable features does seem to be an improvement in comparison to the features selected by FCE the results are still not comparable to the results presented in Table 1. In order to provide for a successful domain-adaptation the previous experiments have shown that the quality of 76

the used generalizable features is essential. Influence of Lambda As for the previous two experiments the influence of the lambda parameter with regard to ANB using improved generalizable features is explored. As the features differ from the previous two experiments it can be expected that the performance curves for this classifier are different as well. The same datasets, parameter settings and feature-sets are used as the previous experiment. In Figure 4 the different graphs are shown for the adapted Naive Bayes classifier using improved generalizable features.

Performance of 1 gram ANB (IMP) vs Lamba
0.6 0.55
Performance

0.5 0.45 0.4 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 2 gram ANB (IMP) vs Lambda
0.73
Performance

0.71 0.69 0.67 0.65 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

Performance of 1 gram + POS ANB (IMP) vs Lambda
0.7 0.65
Performance

0.6 0.55 0.5 0.45 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

77

Performance of 1 gram + 2 gram ANB (IMP) vs Lambda
0.65
Performance

0.55 0.45 0.35 0.05 0.1 0.2 0.4
Lambda

0.6

0.8

1

accuracy

micro F1

macro F1

Figure 4: Performance of the ANB classifier using improved generalizable features with different feature-sets versus different values of the lambda parameter.

When comparing the results for lambda experiments of the improved generalizable features to ANB using FCE, there seem to be very little difference in the curves of the feature-sets. The main difference lies in the small differences in the performance measures most noticeable for the 1 gram + POS-pattern feature-set. As observed for the previous two experiments with ANB a possible explanation for the unsuccessful domain-adaptation can be the quality of the generalizable features. Where in the experiment considering ANB with FCE the quality of the generalizable features prohibited a successful domain-adaptation the same conclusion has to be drawn for the improved generalizable features. Even with more sentimental features among the used generalizable features it has to be concluded that there's are still insufficient qualitative generalizable features to provided a successful domain-adaptation to the different target-domains. Influence of number of features As concluded for ANB using FCE the generalizable features play an important role in the performance of the different classifiers. To explore the importance of the influence of the generalizable features and the number of generalizable features used the experiment differentiating the used number of features is repeated for ANB using improved generalizable features. In order to run the experiment the number of generalizable features are varied between 50 and 1000. To consistently measure the influence of the number of generalizable features the lambda parameter is kept at 0.6 and EM is performed for 1 iteration. It is expected that because different features are used, the curves for this experiment differ from the results for ANB using FCE and there will be another optimal number of generalizable features. To measure, performance is measured in terms of accuracy, micro F1 score and macro F1 score. In Figure 5 the performance for ANB using improved generalizable features is plotted over an increasing number of generalizable features, different feature-sets are evaluated: 1 gram, 2 gram, combined 1 gram + POS-pattern and combined 1 gram + 2 gram.

78

Performance of 1 gram ANB (IMP) vs No. Features
0.53
Performance

0.48 0.43 0.38 50 100 200 400 600 800 1000 1200
Number of Features

Performance of 2 gram ANB (IMP) vs No. Features
0.69
Performance

0.68 0.67 0.66 0.65 50 100 200 400 600 800 1000 1200
Number of Features

Performance of 1 gram + POS ANB (IMP) vs No. Features
0.61
Performance

0.56 0.51 0.46 50 100 200 400 600 800 1000 1200
Number of Features

79

Performance of 1gram+2gram ANB (IMP) vs No. Features
0.57
Performance

0.52 0.47 0.42 50 100 200 400 600 800 1000 1200
Number of Features

accuracy

micro F1

macro F1

Figure 5: Performance of the ANB classifier using improved generalizable features with different feature-sets versus different number of generalizable features used.

Comparing the results for the 1 gram feature-set to ANB using FCE the accuracy curve is similar stable but the main difference between the curves is the micro and macro F1 score which for the improved features shows a less steep increase in performance from 200 to 800 used features. Both graphs converge around 1200 features to the same final result. The graph for the 2 gram feature-set is practically identical to the one observed for FCE, except the expected slightly higher performance measures for the improved features. The main difference between the 1 gram + POS-pattern feature-set is the increase for the FCE graph which is stable as the number of the used number of features increases. For the improved features the performance measures slightly increase from 600 to 800 features but remain stable for the rest of the plotted graph. For the combined 1 gram + 2 gram featureset the same difference is observed as for the 1 gram feature-set: the increase in micro F1 and macro F1 in the interval between 200 to 800 features is steeper in case of the FCE features. In Tan et al. (2009) where this experiment is performed on Chinese datasets the performance of the use of different number of features varies from 0.6 to 0.85 in terms of micro and macro F1 scores for different amounts of used generalizable features. In this experiment the observed micro and macro F1 score don't exceed scores of 0.70. The curve reported in Tan et al. (2009) shows increase as the numbers of generalizable features increases up to a maximum of 0.8, after which a decrease is shown. In this experiment the different curves for the different feature-sets look different. For the 1 gram and combined 1 gram + POS-pattern feature-sets the number of generalizable features used doesn't seem to make much difference for the performance (in terms of accuracy). In terms of micro F1 and macro F1 the performance seems to increase for the 1 gram and combined 1 gram and 2 gram feature-sets up until the use of 1000 features after which performance stabilizes (contradictory to the results reported by Tan et al. (2009) where adding more features after reaching a top decreases). As concluded for ANB using FCE the used generalizable features in this experiment still contain a considerate amount of features which are non-sentimental and skew the domain-adaptation. Using more features does seem to increase performance for most of the feature-sets but the converge to a maximum performance all feature-sets show implies that with using more features more nonsentimental features are used as well. Differences with FCE As has been shown in the previous sections the differences between using FCE to select generalizable 80

features and the suggested improvements are quite little. In this section the differences between the two methods will be explored in more detail. In the next table the differences in probabilities for the two methods will be presented for a couple of domain-adaptations. The differences shown are the features which have the most difference in probability between ANB using FCE and ANB using improved generalizable features for domain-transfers of the kitchen & housewares dataset. As an example the probability differences between the kitchen & housewares dataset and a domain suitable for domainadaptation and a domain less suitable for domain-adaptation are presented. As the experiments in the previous sections have shown the dvd dataset is a domain less suitable for domain-adaptation as results for all the different methods have shown performances below average. The sports & outdoors dataset has shown to report performances above average so this domain has been chosen as a representative example for a domain suitable for domain-adaptation. In Table 28 the differences in probabilities for both 1 gram and 2 gram features are shown.
Differences between FCE and IMP dvd sports & outdoors feature difference feature difference selle 0.0011 nut 0.0038 figuring 0.0011 delicate 0.0030 whateve 0.0005 containing 0.0023 wafer 0.0005 endless 0.0015 soundly 0.0005 drastically 0.0015 quality it 0.0013 pre drilled 0.0016 mine to 0.0013 could buy 0.0016 would put 0.0005 company is 0.0016 was fine 0.0005 working on 0.0008 now just 0.0005 then to 0.0008 Table 28: differences in probabilities between FCE and IMP for domain-adaptations from the kitchen & housewares dataset.

As can be seen in Table 28 there are differences between the resulting probabilities after performing a domain-adaptation with FCE or with improved generalizable features. However, the differences are not very big. A possible explanation for the very small differences in results for both methods as discussed in the previous section can be explained by these small differences. As classification is performed on a sentence level it is likely that a single difference in a term doesn't make a whole of a difference in the overall classification of the complete sentence resulting in few difference for a classification of an entire text. From Table 28 can also be inferred that the detected differences between the resulting features make more sense for the domain more suitable for a domain-adaptation. In the reported features for the sports & outdoor dataset more sentimental generalizable features are reported compared to the dvd dataset. In the sports & outdoors set features are present like “nut”, “delicate”, “drastically” and “could buy”. In the column for the dvd dataset the total number of features denoting sentimental generalizable features is lower with: “soundly” and “quality it”. In Table 29 the different assigned probabilities for the earlier observed domain-specific features “surprise” and “easy” are reported. From these results the differences in the methods can be explored and explanation can be found for the unsuccessful application of the different domain-adaptations.

81

dataset probability “surprise” probability “easy” apparel2camera_&_photo neg 0.0001343 apparel2music neg 0.0000850 pos 0.0000647 pos 0.0000462 camera_&_photo2apparel neg 0.0000423 music2apparel neg 0.0000537 pos 0.0000418 pos 0.0000485 Table 29: probabilities assigned to two features by domain-adaptation using ANB with improved generalizable features.

dataset

In Table 29 can be seen how the domain-adaptation works in terms of assigning probabilities to domain-dependent features. When adapting the apparel domain to the camera & photo domain the probability in a negative context for the feature “surprise” is bigger in comparison to the positive probability. In the case of adapting the domain the other way around: from the camera & photo domain to the apparel domain the opposite adaptation is performed: the negative context has a higher probability in comparison to the positive context. For the less obvious example “easy” domainadaptations are shown from the apparel and the music dataset. In this case it can be seen that although the assigned probabilities are different the highest probability in both cases is assigned to the negative context. As discussed in case of the apparel and camera & photo dataset the domain-adaptation shows different probabilities to the feature but the differences in probability is very small. This can be the explanation to relative low classification results as the small differences between the probabilities probably don't dictate the final difference when classification is performed on sentence level.

82

Chapter 7 Conclusions and future work
In this thesis the three different experiments have shown insight into the behavior of the cross-domain behavior of the Naive Bayes classifier. The first experiment has shown that it is possible to setup a general approach in which the 38 different datasets are consistently used. The 38 datasets are first preprocessed, then trained and evaluated using the Naive Bayes classifier resulting in performance consistent with previous research. For the different evaluated feature-sets accuracy scores all above 0.74 and micro F1 and macro F1 scores above 0.78 are reported. As an additional classifier the subjectivity lexicon classifier was evaluated showing performance as expected lower in comparison to the Naive Bayes classifier and consistent with findings in previous research. From these experiments it can be concluded that the general setup used in this research is valid and can be used in order to setup cross-domain classification experiments as well as serve as a baseline for performance. In order to evaluate the performance of the same general setup in a cross-domain setting a naive crossdomain experiment was setup evaluating the cross-domain performance following the leave-one-out approach (LOO). The accuracy results for this experiment ranged from 0.6178 to 0.6477 which is a considerate decrease in performance in comparison to the in-domain performance of the same classifier with the same feature-sets. The micro and macro F1 scores ranged from 0.6382 to 0.6916. As discussed as far as comparison to previous research was possible, results were consistent. From this experiment it can be concluded that the general setup used in this research is valid in a naive cross-domain setting and the setup can be used in order to explore more elaborate cross-domain approaches. Further it can be concluded, that in order to perform cross-domain sentiment analysis it is necessary to use a more profound method in order to prevent this considerate decrease in performance. The suggested improvements in this research attempt to perform domain-adaptations between source and target domain in order to properly fit a probabilistic model capable of highly accurate cross-domain sentiment classification. In the third set of experiments the research questions stated in this thesis are answered. As a first experiment the approach suggested by Tan et al. (2009) was attempted on the 38 English datasets used in this research. The approach used an adapted form of the Naive Bayes classifier using EM to optimize a probabilistic model using generalizable features as a bridge between source and target-domain. The performance observed in this experiment ranged widely in terms of accuracies from 0.5060 to 0.6063 for the different evaluated feature-sets. The micro and macro F1 scores ranged from 0.4502 to 0.6413. From these experiments it can be concluded that using the adapted Naive Bayes classifier using FCE doesn't help to improve classification results overall. The performance with ANB using FCE shows lower performance measures, especially compared to the reported accuracies from the research by Tan et al. (2009) on Chinese datasets. As the different experiments have shown the generalizable features show room for improvement as the generalizable features calculated by FCE show a lot of common non-sentimental terms. In contrast to the full experiments individual datasets can show increase in performance compared to naive cross-domain classification. The datasets who do show improvement are the bigger datasets having more data available and imbalanced datasets. The majority of the used 83

datasets in these experiments are of a medium size (2000 examples) which show to be very dependent on the domain they originate from. The bigger datasets show to be less dependent on the domain. Future research can be focused on using bigger datasets in order to explore whether the trend observed in this research holds with an experiment with larger sized datasets. The first suggested improvement of selecting generalizable features is the use of a subjectivity lexicon instead of the use of generalizable features acquired from the data. The subjectivity lexicon has been used in the first experiment as a classifier showing stable performance from in-domain to cross-domain and stable performance in terms micro F1 and macro F1 scores. As a first improvement for the generalizable features the lexicon is used directly as generalizable features. The results reported for this experiment range from 0.4817 to 0.6111 in terms of accuracy. The micro and macro F1 scores range from 0.4531 to 0.6312. From the experiments conducted with the subjectivity lexicon as generalizable features it can be concluded that the performance increases very slightly in comparison to ANB using FCE for all the feature-sets except the combined 1 gram + POS-pattern feature-set. The found explanation were the used POS-patterns, which in general are very domain-specific. This domain specificity isn't present in the subjectivity lexicon. As for ANB using FCE, it can be concluded that using a subjectivity lexicon as generalizable features does not help increase classification performance in comparison to the naive cross-domain baseline. Also consistent to using ANB using FCE the datasets who do show improvement are the bigger datasets having more data available. The bigger datasets show to be less dependent on the domain. Future research can be focused on using bigger datasets in order to explore whether the trend observed in this research holds with an experiments with larger datasets. Another possible improvement can be the use of a smaller subjectivity lexicon. As described the subjectivity lexicon is quite big which could have as a consequence that domain-dependent, nonsentimental features are present in the lexicon as well. Narrowing down the subjectivity lexicon could result in better results. In the final set of experiments an improvement for the selection of generalizable features is evaluated using an importance measure in addition to the use of FCE to select generalizable features. The observed selected features in this experiment do seem to show improvements in comparison to solely using FCE and to using a subjectivity lexicon as generalizable features. In comparison to ANB using a subjectivity lexicon all feature-sets increase in performance. In comparison to ANB using FCE all feature-set increase with exception of the POS-pattern feature-set. Looking at the generated generalizable features, less non-sentimental terms are observed and more features containing sentiment are reported. The results reported for this experiment range from 0.4959 to 0.6099 in terms of accuracy. The micro and macro F1 scores range from 0.4544 to 0.6415. As the increases between the two other methods using ANB are very small the conclusion has to be drawn that using a improved generalizable features does not help increase classification performance in comparison to the naive cross-domain baseline. However the improved generalizable features do show a slight increase to the ones selected by FCE. This leads to the conclusion that performance increases can be achieved when trying to optimize the generalizable features used to initiate the Naive Bayes model. Future research can focus on further trying to acquire better generalizable features in order to be used to facilitate the domaintransfer. Consistent to using ANB using FCE the datasets who do show improvement are the bigger datasets having more data available. The bigger datasets show to be less dependent on the domain. Leading to a better representation of generalizable features. Future research can also be focused on using bigger datasets in order to explore whether the trend observed in this research holds with an experiment with larger datasets.

84

References
Airoldi, E.M., Bai, X., and Padman, R. (2006). Markov blankets and meta-heuristic search: Sentiment extraction from unstructured text. Lecture Notes in Computer Science, 3932:167–187. Aue, A. and Gamon, M. (2005). Customizing Sentiment Classifiers to New Domains: a Case Study. Submitted to RANLP-05, the International Conference on Recent Advances in Natural Language Processing, Borovets, BG, 2005 Balasubramanian, R., Cohen, W., Pierce, D. and Redlawsk, D.P. (2011). What pushes their buttons? predicting polarity from the content of political blog posts. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 12–19, Portland, Oregon, 23 June 2011. Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D., and Subrahmanian, V. (2007). Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the International Conference in Weblogs and Social Media (ICWSM). Bethard, S., Yu, H., Thornton, A., Hatzivassiloglou, V., and Jurafsky, D. (2004). Automatic Extraction of Opinion Propositions and their Holders. In AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. Blitzer, J., Dredze, M., Pereira, F. (2007). Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Association of Computational Linguistics (ACL), 2007 Bollegala D., Weir, D. and Carroll, J. (2011). Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 132–141, Portland, Oregon, June 1924, 2011. Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 722-727, Menlo Park, CA: AAAI Press. Choi, Y., Cardie, C., Riloff, E., and Patwardhan, S. (2005). Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns. In Proceedings of HLT/EMNLP-05. Church, K.W. and Hanks, P. (1989). Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Conference of the Association of Computational Linguists, pages 76– 83. Chklovski, T. (2006). Deriving Quantitative Overviews of Free Text Assessments on the Web. In Proceedings of 2006 International Conference on Intelligent User Interfaces (IUI06), Sydney, Australia. Das, S. and Chen, M. (2001). Yahoo! For Amazon: Extracting market sentiment from stock message boards. In Proceedings of the 8th Asia Pacific Finance Association Annual Conference (APFA 2001). 85

Dave, K., Lawrence, S., and Pennock, D.M. (2003). Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. In Proceedings of the International World Wide Web Conference, Budapest, Hungary. Denecke, K. (2009). Are SentiWordNet scores suited for multi-domain sentiment classification? By Fourth International Conference on Digital Information Management (2009). Publisher: IEEE, Pages: 1-6. Engström, C. (2004). Topic dependence in sentiment classification. Unpublished M.Sc. thesis, University of Cambridge. Esuli, A. and Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss classification. In Proceedings of CIKM-05, 14th ACM International Conference on Information and Knowledge Management, Bremen, DE, pp. 617-624. Gamon, M. (2004). Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of the International Conference on Computational Linguistics (COLING). Ghosh, S. and Koch, M. Unsupervised Sentiment Classification Across Domains using adjective cooccurrence for polarity analysis. Unpublished. Go, A., Bhayani, R. and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. In Entropy (2009), Issue: June, Publisher: Association for Computational Linguistics, Pages: 30-38 He, Y., Lin, C., Alani, H. (2011). Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification. In Proceedings of ACL, 2011, 123-131. Hatzivassiloglou, V. and McKeown, K. (1997). Predicting the Semantic Orientation of Adjectives. In Proceedings of 35th Annual Meeting of the Assoc. for Computational Linguistics (ACL-97): 174-181 Hatzivassiloglou, V. and Wiebe, J. (2000). Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the International Conference on Computational Linguistics (COLING). Hearst, M.A. (1992). Direction-based text interpretation as an information access refinement. In TextBased Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval, Mahwah, NJ: Lawrence Erlbaum Associates. Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews". In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), Seattle, Washington, USA. Hu, M. and Liu, B. (2005). Mining and summarizing customer reviews. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. In 86

Journal of Documentation, 28:11–21. Kim, S. and Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational Linguistics. Kim, S. and Hovy, E. (2006). Identifying and Analyzing Judgment Opinions. In Proceedings of HLT/NAACL-2006, New York City, NY. Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. In Psychology Review. Liu, B. (2006). Web Data Mining. Chapter Opinion Mining, Springer. Liu, K. and Zhao, J. (2009). Cross-domain sentiment classification using a two-stage method. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1717–1720, 2009 Matsumoto, S., Takamura, H., and Okumara, M. (2005). Sentiment classification using word subsequences and dependency sub-trees. In Proceedings of PAKDD’05, the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. McCallum, A. and Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML Workshop on learning for Text Categorization (1998). Melville, P., Gryc, W., and Lawrence, R. D. (2009). Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the Conference on Knowledge Discovery and Data Mining 2009. Nigam, K., Lafferty, J., and McCallum, A. (1999). Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of EMNLP 2002. Pang, B. and Lee, L. (2004), A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proceedings of ACL 2004. Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL pages 115-124, 2005 Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. In Foundation and Trends in Information Retrieval, 2(1-2):1–135. Popescu, A., and Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews. In Proceedings of HLT-EMNLP 2005. Prinz, J. (2004). Gut Reactions: A Perceptual Theory of Emotion. Oxford University Press. 87

Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceeding ACL student '05 Proceedings of the ACL Student Research Workshop Riloff, E., Wiebe, J. and Wilson, T. (2003). Learning Subjective Nouns Using Extraction Pattern Bootstrapping. In Proceedings of Seventh Conference on Natural Language Learning (CoNLL-03). ACL SIGNLL. Pages 25-32. Riloff, E., Wiebe, J., and Phillips, W. (2005). Exploiting subjectivity classification to improve information extraction. In 20th National Conference on Artificial Intelligence (AAAI-2005). Rothfels, J and Tibshirani, J. (2010) Unsupervised sentiment classification of English movie reviews using automatic selection of positive and negative sentiment items. Snyder, B. and Barzilay, R. (2007). Multiple Aspect Ranking using the Good Grief Algorithm. In NAACL 2007 Spertus, E. (1997). Smokey: Automatic recognition of hostile messages. In Proceedings of the Conference on Innovative Applications of Artificial Intelligence, pp. 1058-1065,. Menlo Park, CA: AAAI Press. Tan, S., Wu, G., Tang, H., and Cheng, X. (2007). A novel scheme for domain-transfer problem in the context of sentiment analysis. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM ’07, New York, NY, USA. Tan, S., Cheng, X., Wang, Y. and Xu, H. (2009). Adapting Naive Bayes to Domain Adaption for Sentiment Analysis. In ECIR 2009: 337-349 Tatemura, J. (2000). Virtual reviewers for collaborative exploration of movie reviews. In Proc. of the 5th International Conference on Intelligent User Interfaces, pages 272–275. Terveen, L., Hill, W., Amento, B., McDonald, D, and Creter, J. (1997). PHOAKS: A system for sharing recommendations. In Communications of the ACM, 40(3): 59–62. Thomas, M., Pang, B., and Lee, L. (2006). Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Empirical Methods in Natural Language Processing (EMNLP). Turney, P.D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of ACL-02, Philadelphia, Pennsylvania, 417-424, 2002. Turney, P. D. and Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association. In ACM Transactions on Information Systems (TOIS), 21(4):315–346. Whitehead, M. and Yaeger, L. (2009). Building a General Purpose Cross-Domain Sentiment Mining Model. In Proceedings of CSIE (4)'2009. pp.472-476.

88

Wiebe, J., Bruce, R., and O’Hara, T. (1999). Development and use of a gold standard data set for subjectivity classifications. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguists (ACL-99), pages 246–253. Wiebe, J. M., Wilson, T., Bruce, R., Bell, M., and Martin, M. (2004). Learning subjective language. In Computational Linguistics, 30:277–308. Wilson, T, Wiebe, J., and Hwa, R. (2004). Just how mad are you? Finding strong and weak opinion clauses. In Proceedings of 19th National Conference on Artificial Intelligence (AAAI-2004). Wilson, T., Wiebe, J. and Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proceedings of HLT/EMNLP 2005, Vancouver, Canada Wilson, T., Wiebe, J., and Hoffmann, P. (2009). Recognizing Contextual Polarity: an exploration of features for phrase-level sentiment analysis.In Computational Linguistics 35(3). Wu, Q. , Tan, S. and Cheng, X. (2009a). Graph ranking for sentiment transfer. In ACL-IJCNLP, pages 317–32 Wu, Q., Tan, S., Zhai, H., Zhang, G., Duan, M., Cheng, X. (2009b). SentiRank: Cross-Domain Graph Ranking for Sentiment Classification, In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, p.309-314, September 15-18, 2009 Wu, Q., Tan, S., Duan, M., and Cheng, X. (2010). A Two-Stage Algorithm for Domain Adaptation with Application to Sentiment Transfer Problems. In AAIRS 2010: 443-453 Yang, Y., Liu, X. (1996). A Re-examination of Text Categorization Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US (1996) Yessenalina, A., Choi, Y. and Cardie, C. (2010). Automatically generating annotator rationales to improve sentiment classification. Yu, H. and Hatzivassiloglou, V. (2003). Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

89