www.visibletechnologies.com - 888.852.0320 (US) +44 20 7887 6202 (UK)

There are degrees of sentiment. Early work in review mining (movie reviews. polarity. including search engines. affective computing. due to the explosive growth of social media. Polarity obviously changes in different contexts as well (“I love Coke but hate Pepsi” is Positive for Coke but Negative for Pepsi). There is also no standard for qualitative labels. For a broader introduction and survey of sentiment analysis. One could argue that almost all text contains some amount of sentiment. We focus here on two aspects of sentiment analysis: 1. or intended emotional communication of a speaker or writer. human evaluators are uncertain or disagree about the sentiment contained in text. In a substantial minority of cases. it’s not simply the presence or absence of sentiment. review mining. however. Opinion mining is a term frequently used to describe these efforts to find valuable information in the vast quantity of usergenerated content. etc. see [Liu09] and [PangLee08]. Negative). which is the aspect of opinion mining that has received the most attention. Retrieving examples or summaries of sentiment along those same dimensions. Other work in information retrieval includes relevance (is this text relevant to my query?) and subjectivity (and contains an opinion?). Sentiment evaluation relies heavily on context. emotional state. this paper focuses on sentiment analysis. tonality. for several reasons: • • Sentiment is subjective. What is Sentiment Analysis? Opinion mining is a broad topic. assume that data is factual. there is no quantitative measure of what that means. recent and emerging research focuses on techniques to evaluate opinion in user-generated content. but a matter of gradation. yet the standard tools have no way to assess that opinion in a meaningful way. There is now a large body of data that includes opinion. Sentiment analysis has broad application and encompasses work in classifying subjectivity. • . As a result. but not polarity. appraisal extraction. The mining metaphor is apt: opinion mining techniques need to be able to locate both trends in social sentiment and specific examples—just as miners need to identify both a seam and specific pieces of ore to extract. book reviews) assumes subjectivity and measures only polarity (Positive vs. Even if several people agree on a qualitative label (say Very Positive). Evaluating Sentiment Analysis Measuring the performance of a sentiment analysis solution is more complicated that one might expect.MEASURING SOCIAL SENTIMENT: ASSESSING AND SCORING OPINION IN SOCIAL MEDIA Opinion Mining Most information processing techniques. 2. opinion. Quantifying sentiment for an entity or population over time. The goal of sentiment analysis is to determine the attitude. This assumption has become less accurate. emotion mining.

There are many statistical measures of correctness given assignment of labels in a multi-class problem (error. To the layman. For example.). or some combination of those. measuring impact of advertising campaigns. In a situation where the true distribution is 10% Positive. 3. Negative. but there are many exceptions. However. comparing to competitors. if you think about the way sentiment is used in social media analytics. Instead. classifying tweets is much easier than classifying blog posts: the difficulty of informal and idiosyncratic language is vastly outweighed by the constraint of a single. In our experience. The population of documents matters. short thought or comment. When evaluating a sentiment solution according to its ability to classify documents. it may also be the first idea: read a representative set of documents one-by-one and mark your agreement. The distribution of the class labels matters. etc. sentence.12+. or pattern level. and always guessing Neutral gets you 85% accuracy (which sounds great but is completely useless). Some of this comes from early work on sentiment analysis in the social sciences (looking at kappa and other agreement statistics among annotators). But does document level accuracy capture that? It is true that if you could classify documents perfectly. 1. . random guessing with equal class probabilities gets you 33% accuracy. accuracy on Twitter tends to be about ten percentage points higher than on social media content overall. It is also an obvious metric for those analysts building text classification models (minimizing document misclassification error is the same as document-level accuracy). It isn’t meaningful to compare a two-class polarity classifier (where random guessing will perform at 50% accuracy) to a three-class sentiment classifier (where random guessing will perform at 33% accuracy) to a four-class sentiment classifier (like the Visible solution. this family of metrics may seem sufficient and broadly applicable.• Sentiment can also be measured at different granularities. In any analytics evaluation it’s crucial to measure what you care about from a business objective. accuracy. we’ll discuss three different families of evaluations and see how they align to the ways people use sentiment in social media analytics.052+. The consequence is that a model that does better at document classification may do worse at quantifying sentiment for a population (and vice versa). all models have some bias. and 85% Neutral. or Neutral).852). sentiment can be measured at the document. then you could sum up a perfect quantification for any population. kappa. given the context of a query or subject. looking for spikes. 5% Negative. in practice. paragraph. it’s important to keep a few other things in mind. Unfortunately.5% accuracy (. This approach generally requires agreeing on a consistent set of labels or classes and assigning documents to those classes. comparing products. If all classes are equally likely. phrase. where random guessing will perform at 25% accuracy). or had a completely unbiased model. For people who are auditing a social media monitoring solution.). document-level accuracy itself rarely matters in terms of the business problem. then a random guess will get you 1/n accuracy (where n is the number of classes). the primary use of sentiment is quantifying it on a population (tracking sentiment over time. It is popular to treat this as a three-class problem (for example Positive. For example. 2. Along those lines. fmeasure. etc. The number of classes matters. Document-Level Evaluation One of the most common ways to measure sentiment performance in social media analytics is accuracy at the document level. guessing using the true distributions gets you 73.

3. measure how well the solution does on relevance. Information Retrieval Evaluation Another important use case in social media is retrieving representative sentimented documents. If you must summarize to a single metric. While this might sound reasonable at first. or Euclidian distance. The most obvious metric to look at for quantification or estimation is the error. Let’s look at an example where the true distribution for a population is 11% Positive. 10% Negative. Negative.9% from the true distribution. For example. but precision is a simple and useful metric. Since this is what we care about most.Aggregate-Level Evaluation The primary use of sentiment analysis in social media is to track the attitudes about a topic or opinions of a population. and polarity (Positive. As an example of some of the confusing accuracy claims out there. Results can be rank ordered and presented from most to least Positive. and 77% Neutral. Rather than looking at all classes at once. While it may be tempting to try to turn these numbers into an “accuracy” measure. and 81% Neutral. forums for example). you might want to read examples of negative posts. Our three class example above (13%/10%/77%) is closer to the true distribution (11%/8%/81%) than 99% of random guesses. or distance from the true solution. or Neutral. random points tend to be far from each other. assessing each one separately. Is this 99% accuracy? Here are a few recommendations for evaluating a sentiment solution according to its ability to quantify an aggregate-level distribution: 1. concentrate on transparent metrics like error or distance. and our sentiment solution estimates the distribution to be 13% Positive. Directly assess real versus estimated distribution to make conclusions. or use statistically grounded measures like a chi-squared test if you are comfortable with them. Instead. An individual post doesn’t have to be labeled with a single class but can be a mixture of various sentiment dimensions (for example. Mixed). The L2. 8% Negative. The L1 distance is 2%+2%+4% = 8% from the true distribution. blogs vs. a little analysis shows that this metric isn’t very useful. Non-Neutral). When comparing claims. however it’s obvious that the estimation error is not constrained to 0-1 and you can end up with negative accuracy by doing so (for example 100%/0%/0% vs. and type of data (Twitter vs. subjectivity (Neutral vs. make sure they are on the same number of class labels. 0%/100%/0% gives an L1 distance of 200% and an L2 distance of 141%). Due to the well-known statistical properties of high-dimensional geometry (the “curse of dimensionality”). the same topics (true distributions). For example. . Deconstruct the problem to get a better idea of performance. is sqrt(2%2+2%2+4%2) = 4. The nice thing about searching for examples is that posts don’t have to be labeled with discrete labels like Positive. one social media monitoring vendor reports accuracy as the percentage of the time their estimation is closer to the true distribution than a random guess (distribution) would be. such attempts convolute the real performance of the solution. it makes sense to measure how well sentiment solutions do with this problem in particular. a degree of belonging can be used. you might be tempted to use a 1 – the estimation error as an accuracy measure. Very Positive and Somewhat Negative). There are many rank-ordering metrics that can be applied to this problem. Negative. 2.

insight. well trained. it is often useful to trade off recall for precision (if there were 50 other Negative posts that were not presented. give comparable results to humans in certain real-world scenarios. Finally. and interpretation of those results. the recall of the solution is only 20/70 = 29%). and usually low-paying job where it’s easy to produce low-quality work. Automated techniques are tireless. customers sometimes have a hard time with the types of mistakes the automation makes. Humans can’t compete with the speed and consistency of an automated solution. forums. Both are necessary and work together in partnership. that can be built into the evaluation metrics. and careful human annotator. a false one. For example. 4. . if you asked for Negative posts and the solution presents 25 posts. This is because there are typically many results available.or macro-averaging precision numbers. Annotating documents for sentiment is a thankless. This does not necessarily generalize to document-level accuracy overall. etc. As with other measures. and the user experience is best when the first few pages of the results are highly precise. in practice. For information retrieval tasks. and can be improved over time.). for example). 1. If you are evaluating a sentiment solution according to its ability to retrieve appropriate examples. One interesting observation is that automated techniques sometime make errors in an inhuman way. Negative precision. 2. If the degree or magnitude of the mistake is important to your application. tedious. When comparing claims. Comparing Automation to Humans For most data. An intelligent solution will present you the best results on the first few pages. Look at each of the class precisions directly (Positive precision. the tradeoff between human and automated sentiment analysis is. the precision of the solution is 20/25 = 80%. blogs vs. Automated solutions can. make sure they are on the same number of class labels. Even when the same number of mistakes are made by humans and automated techniques. which can be misleading. here are a few recommendations. The distribution of the classes also: matters a skewed distribution like 10%/5%/85% will give lower precisions for smaller classes. fast (can score long posts in milliseconds). however. but are needed for judgment. automated sentiment technology can not yet reach the quality of a smart. the same topics (true distributions).Precision is a measure of the solution’s success in retrieving posts that are relevant to the label you specify. and the same type of data (Twitter vs. 3. Avoid micro. Mistakes made by humans tend (to another human) to be less blatant or more understandable. the number of classes matters (precision on a two-class problem is not comparable to precision on a four-class problem). Mistakes made by automation can tend (to a human) to seem “obviously” wrong. consistent (they don’t make random errors). 20 of which you agree are negative.

Business Applications of Sentiment If you have looked at social media monitoring platforms to help you better understand what consumers are saying about your brand on the social web. it is important to look beyond the percentages and random sampling and evaluate how the sentiment measure fits into the overall analytics solution. keep in mind that the accuracy figures of a three-sentiment classification solution are inherently different from those of a solution with a four-sentiment classification that includes a Mixed sentiment score. This is a perfectly acceptable approach to classifying sentiment. Removing the automated functionality and relying on human scorers wouldn’t solve the problem: according to Lexalytics. humor. In this section. and factors to keep in mind when evaluating it. or early detection of negative incidents? Significantly better off than they were before. yields at best 80 percent. It is important. allowing them to isolate the themes or issues driving that sentiment. This is particularly true of social media content. relevance. and it means that evaluating the performance and accuracy of any solution is complex. subject matter. understanding the context. variances in interpretation. Sentiment analysis is still valuable. even with strict judging by individuals who do their best to take context. and the context of what is being expressed create challenges for both humans and machines. This means that a sentiment score will be wrong 1 out of every 5 times. The first has a one in three chance of randomly getting it right. Evaluating Sentiment Accuracy across Solutions Let’s talk about practical ways of evaluating solutions you may be considering. change and manipulate your criteria on . filter. sentiment has probably come up on more than one occasion. Subjectivity. Start with the obvious and compare accuracy rates. and sarcasm into consideration. we look at what sentiment means from a business perspective. however. but it is also a figure that can be misrepresented or tailored to support a given scenario. Negative. particularly when used alongside other indicators such as volume change and frequency analysis. how it can be used by a business looking at social media. and what types of quality checks and improvements they are making. Each of these uses can help provide great insight into social data and can help propel a brand forward. The Nuances of Sentiment One of the challenges of understanding and applying a sentiment analysis solution in a business setting is that sentiment is not a one-dimensional result with a universally agreed upon set of criteria. As demonstrated above. So where does that leave businesses that depend on the accuracy of analytics for critical business decisions. how they address and improve the system’s accuracy and learning over time. accuracy is a measure that can be calculated. and refine their system. or Neutral (no sentiment expressed). It is important to remember. In addition to the accuracy figures. sarcasm. Ask about the type and amount of data they used to create. and search for content based on the sentiment? Can you drill down into results. For example most solutions tend to divide sentiment into three classes—Positive. Keeping this in mind will prepare you to understand both the power and limitations of sentiment analysis. or situations like product recalls. to have realistic expectations of what is possible and an understanding of the limitations that exist. human analysts agree on sentiment scoring on average only 80 percent of the time. Sentiment scores can give users a straightforward way to segment and filter content based on positive or negative commentary. and wit used in human communications can require as much art as science. that while powerful and accurate scientific methods can be applied to analyzing what people are expressing online. while the other has a one in four chance simply because there are more choices. Given that that the highest benchmark of accuracy. consider the vendor’s explanations about how accuracy is calculated. Does the platform allow you to sort. So it is important to look behind the numbers and make sure the total solution is suited to your organization’s needs. however. A sentiment score can be an extremely useful in evaluating a large data set of social brand mentions. train. It also allows for dynamic and illustrative reporting of trends and market reactions. human scoring. however. intention. but be sure to keep an apples-to-apples perspective when doing so.

888.0320 (US) +44 20 7887 6202 (UK) . Does the solution provide enough to for your business purposes? For example. or does a simple keyword/phrase match identify relevant content? The more contextual the subject matter is. How does the vendor define accuracy for sentiment and based on what criteria? Is this the same for all vendors I’m considering? How is sentiment applied? At the phrase. Opinion mining and sentiment analysis.852.visibletechnologies. and sentiment will be potentially less meaningful. Foundations and Trends in Information Retrieval. or for very specific issues? Can I segment.com . found in the V-IQ Lounge at http://www. Sentiment analysis and subjectivity. Pang.) A Mixed sentiment score. L. search . B.visibletechnologies. and Neutral categories. would enable you to identify when consumers are on the fence about making a purchase decision—perhaps the price is right. and Lee. 2009. 2008. Handbook of Natural Language Processing. 2(1-2):1-135. www. or document level? Does it depend on the context of the query? Does this make sense for the type of subject matter and depth we want to understand? Was the sentiment system built for social media? Is it based on the unique communication styles and forms of social media data? Does the platform allow me to override or otherwise earmark a sentiment score I disagree with or want to report differently? • • • • To learn more about Visible’s approach to social media sentiment assessment read our white paper The Visible Approach to Assessing Social Media Sentiment. Second Edition. Negative.com/resources/ References Liu. or would identifying Mixed posts be beneficial? (Mixed indicates both positive and negative sentiment within a single post. keep the following questions in mind when evaluating social sentiment solutions and choosing the platform that is right for your business: • • What are my social media goals and how does sentiment fit into the equation? What type of reporting do I want to do and how does sentiment help me do that? o o o • At the industry level. filter. is it adequate to have Positive. B. and sort by sentiment in the platform? How refined do I want my sentiment—Are Positive and Negative enough or do Mixed and Neutral sentimented scores matter? Is my subject matter more contextual. brand level. the more time you or an analyst will be spending reading content for relevance to the subject matter.the fly. Conclusions In summary. paragraph. sentence. but a concern about replacement-part availability is what is keeping someone from buying. and create an unlimited number of queries to answer all of your questions? Consider the number of sentiment classes.

Sign up to vote on this title
UsefulNot useful