Opinion Mining

University of Sheffield NLP
Opinion Mining in GATE

Horacio Saggion & Adam Funk
Opinion Mining
• Is interested in the opinion a particular piece of discourse expresses
– Opinions are subjective statements reflecting people’s
sentiments or perceptions on entities or events
• There are various problems associated to opinion mining
– Identify if a piece of text is opinionated or not (factual news vs.
Editorial)
– Identify the entity expressing the opinion
– Identify the polarity and degree of the opinion (in favour vs.
against)
– Identify the theme of the opinion (opinion about what?)
Extract Factual Data with

Information Extraction from
Company Web Site
Extract Opinions using Opinion

Mining from Web Fora
Application
• Combine information extraction from company Web site with OM
findings
– Given a review find company web pages and extract factual
information from it including products and services
– Associate the opinion to the found information
• Use information extraction to identify positive/negative phrases and
the “object” of the opinion
– Positive: correctly packed bulb, a totally free service, a very
efficient management…
– Negative: the same disappointing experience, unscrupulous
double glazing sales, do not buy a sofa from DFS Poole or DFS
anywhere, the utter inefficiency…
Opinions on the Web sentiment
opinion
positive opinions
negative opinions
negative opinion, but less evident

OM as text classification
• Because we have access to documents which have already an associated class, we
see OM as a classification problem – we consider our data “opinionated”
• We are interested in:
– differentiate between positive opinion vs negative opinion
• “customer service is diabolical”
• “I have always been impressed with this company”
– recognising fine grained evaluative texts (1-star to 5-star classification)
• “one of the easiest companies to order with” (5-stars)
• “STAY AWAY FROM THIS SUPPLIER!!!” (1-star)
• We use a supervised learning approach (Support Vector Machines) that uses
linguistic features; the system decides which features are most valuable for
classification
• We use precision, recall, and F-score to assess classification accuracy
Corpus
• We have a customisable crawling process to collect all texts from Web fora
• 92 texts from a Web Consumer forum
– Each text contains a review about a particular company/service/product
and a thumbs up/down – texts are short (one/two paragraphs)
– 67% negative and 33% positive
• 600 texts from another Web forum containing reviews on companies or
products
– Each text is short and it is associated with a 1 to 5 stars review
– * ~ 8%; ** ~ 2; *** ~ 3%; **** ~ 20%; ***** ~ 67%
• Each document is analysed to separate the commentary/review from the
rest of the document and associate a class to each review
• After this, the documents are processed with GATE processing resources:
– tokenisation; sentence identification; parts of speech tagging;
morphological analysis; named entity recognition, and sentence parsing
SVMs for OM
• Support Vector Machines (SVM) are very good algorithms used for
classification and have been also used in information extraction
• Learning in SVM is treated as a binary classification problem and a multiclass
problem is transformed in a set of n binary classification problems
• Given a set of training examples, each is represented as a vector in a space
of features and SVM tries to find an hyper plane which separates positive
from negative instances
• Given a new instance SVM will identify in which side of the hyper plane the
new instance lies and produce the classification accordingly
• The distance from the hyper plane to the positive and negative instances is
the margin and we use SVM with uneven margins available in GATE
• In order to use them, we need to specify how instances are represented and
decide on a number of parameters usually adjusted experimentally over
training data
Bag-of-words binary-classification
• We decided to start investigating a very simple approach – word-based or bag of
words approach (usually works very well in text classification)
– the original word
– the root or lemma of the word (for “running” we use “run”)
– the parts of speech category of the word (determinant, noun, verb, etc.)
– the orthography of the word (all uppercase, lowercase, etc.)
• Each sentence/text is represented as a vector of features and values
– we carried out different combinations of features (different n-grams)
– 10-fold cross validation experiments were run over the corpus with binary
classifications (up/down)
– the combination of root and orthography (unigram) provides the best classifier
• around 80% F-score
– use of higher n-grams decreases performance of the classifier
– use of more features not necessarily improves performance
– a uninformed classifier would have a 67% accuracy
Bag-of-words fine-grained
classification
• Same learning system used to produce the 5 stars
classification over the fine-grained dataset
• Same feature combinations were studied:
– 74% overall classification accuracy using word root
only
– other combinations degrade performance
– 1* classification accuracy = 80%; 5* classification
accuracy = 75%
– 2* = 2%; 3*=3%; 4*=19%
– 2*, 3*, 4* difficult to classify because or either share
vocabulary with extreme cases or are vague
Relevant features according to the SVM models

• word-based binary classification
– thumbs-down: !, not, that, will, …
– thumbs-up: excellent, good, www, com, site, …
• word-based fine-grained classification
– 1*: worst, not, cancelled, avoid,…
– 2*: shirt, ball, waited,….
– 3*: another, didn’t, improve, fine, wrong, …
– 4*: ok, test, wasn’t, but, however,…
– 5*: very, excellent, future, experience, always, great,…
Sentiment-based Classifier
• Engineered features based on “linguistic” and sentiment information
associated to words
• Linguistic features
– word-based features are restricted to adjective and adverbs and their bigram
combinations
– “good”, “bad”, “rather”, “quite”, “not”, etc.
• Sentiment information
– WordNet lexical database where words appear with their senses and synonyms
• chair = the furniture
• chair, professorship = the position
• chair, president, chairman, … = the officer
• chair, electric chair, … = execution instrument
– SentiWordNet adds sentiment information to WordNet and has been used in
opinion mining and sentiment analysis
Sentiment-based classifier
• SentiWordNet (cont.)
– each word has three numerical scores (between 0 and 1): obj, pos,
neg (obj+neg+pos=1)
Cat WNT # pos neg synonyms

a 1006645 0.25 0.375 good well
a 1023448 0.375 0.5 good unspoilt unspoiled
a 1073446 0.625 0.0 good
a 1024262 0.0 1.0 bad spoilt spoiled
Sentiment-base classifier
• Features deduced from SentiWordNet
– word analysis:
• countP(w) : the word positivity score (#(pos(w)>neg(w)))
• countN(w) : the word negativity score (#(pos(w)<neg(w)))
• countF(w): the number of entries of w in SentiWordNet
– sentence analysis
• sentiP: number of positive words in sentence
– a word is positive if countP(w)>½countF(w)
• sentiN: number of negative words in sentence
– a word is negative if countN(w)>½countF(w)
• senti: pos (sentiP > sentiN), neg (sentiN > sentiP), neutral (sentence feature)
– text analysis:
• count_pos: number of pos sentences in text
• count_neg: number of neg sentences in text
• count_neutral: number of neutral sentences in text
Sentiment-based Classifier
• Each text is represented as a vector of features and values
– combining the linguistic features (adjectives, adverbs, and their
combinations) and the senti, count_pos, count_neg,
count_neutral features
– 10-fold cross validation experiments were run over the corpus
with binary classifications (up/down)
• overall F-score 76%
– 10-fold cross validation over the fine-grained corpus
• overall F-score 72%
• 1*=58%, 2*=24%, 3*=20%, 4*=19%, 5*=83% (better job in
less extreme categories)
Relevant features according to the SVM models

• sentiment-based binary classification
– thumbs-down: 8 neutral , never, 1 neutral, negative sentiment
(senti feature), very late
– thumbs-up: 1 negative , 0 negative , good, original, 0 neutral,
fast
• sentiment-based fine-grained classification
– 1*: still not, cancelled, incorrect,…
– 2*: 9 neutral, disappointing, fine, down, …
– 3*: likely, expensive, wrong, not able,….
– 4*: competitive, positive, ok, …
– 5*: happily, always, 0 negative, so simple, very positive, …
Use of opinion words in OM

• Hatzivassiloglou&McKeown’97 note that conjunctions (and, or, but,
…) help in classifying the semantic orientation of adjectives
(excellent and useful; good but expensive;…); not used in
classification experiments
• Riloff&al’03 create a list of subjective words by bootstrapping an
initial set of 20 subjective words over a corpus; using the induced list
and other features achieves 76% classification accuracy (objective
vs subjective distinction)
• Turney’02 uses pair-wise mutual information to detect the polarity of
words (mutual information wrt “excellent” and “poor”); using the list in
a classifier he achieves 74% classification accuracy
• Devitt&Ahmad’07 use SentiWordNet for detecting the polarity of a
piece of news (7-point scale) achieving 55% accuracy

Opinion Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opinion Mining

Uploaded by

Copyright:

Available Formats

University of Sheffield NLP

Opinion Mining in GATE

Extract Factual Data with

Extract Opinions using Opinion

Opinions on the Web sentiment

negative opinion, but less evident

Relevant features according to the SVM models

Cat WNT # pos neg synonyms

Relevant features according to the SVM models

Use of opinion words in OM

You might also like