Professional Documents
Culture Documents
Opinion Mining
Opinion Mining
Opinion Mining
• Is interested in the opinion a particular piece of discourse expresses
– Opinions are subjective statements reflecting people’s
sentiments or perceptions on entities or events
• There are various problems associated to opinion mining
– Identify if a piece of text is opinionated or not (factual news vs.
Editorial)
– Identify the entity expressing the opinion
– Identify the polarity and degree of the opinion (in favour vs.
against)
– Identify the theme of the opinion (opinion about what?)
University of Sheffield NLP
Application
• Combine information extraction from company Web site with OM
findings
– Given a review find company web pages and extract factual
information from it including products and services
– Associate the opinion to the found information
• Use information extraction to identify positive/negative phrases and
the “object” of the opinion
– Positive: correctly packed bulb, a totally free service, a very
efficient management…
– Negative: the same disappointing experience, unscrupulous
double glazing sales, do not buy a sofa from DFS Poole or DFS
anywhere, the utter inefficiency…
University of Sheffield NLP
opinion
University of Sheffield NLP
positive opinions
negative opinions
OM as text classification
• Because we have access to documents which have already an associated class, we
see OM as a classification problem – we consider our data “opinionated”
• We are interested in:
– differentiate between positive opinion vs negative opinion
• “customer service is diabolical”
• “I have always been impressed with this company”
– recognising fine grained evaluative texts (1-star to 5-star classification)
• “one of the easiest companies to order with” (5-stars)
• “STAY AWAY FROM THIS SUPPLIER!!!” (1-star)
• We use a supervised learning approach (Support Vector Machines) that uses
linguistic features; the system decides which features are most valuable for
classification
• We use precision, recall, and F-score to assess classification accuracy
University of Sheffield NLP
Corpus
• We have a customisable crawling process to collect all texts from Web fora
• 92 texts from a Web Consumer forum
– Each text contains a review about a particular company/service/product
and a thumbs up/down – texts are short (one/two paragraphs)
– 67% negative and 33% positive
• 600 texts from another Web forum containing reviews on companies or
products
– Each text is short and it is associated with a 1 to 5 stars review
– * ~ 8%; ** ~ 2; *** ~ 3%; **** ~ 20%; ***** ~ 67%
• Each document is analysed to separate the commentary/review from the
rest of the document and associate a class to each review
• After this, the documents are processed with GATE processing resources:
– tokenisation; sentence identification; parts of speech tagging;
morphological analysis; named entity recognition, and sentence parsing
University of Sheffield NLP
SVMs for OM
• Support Vector Machines (SVM) are very good algorithms used for
classification and have been also used in information extraction
• Learning in SVM is treated as a binary classification problem and a multiclass
problem is transformed in a set of n binary classification problems
• Given a set of training examples, each is represented as a vector in a space
of features and SVM tries to find an hyper plane which separates positive
from negative instances
• Given a new instance SVM will identify in which side of the hyper plane the
new instance lies and produce the classification accordingly
• The distance from the hyper plane to the positive and negative instances is
the margin and we use SVM with uneven margins available in GATE
• In order to use them, we need to specify how instances are represented and
decide on a number of parameters usually adjusted experimentally over
training data
University of Sheffield NLP
Bag-of-words binary-classification
• We decided to start investigating a very simple approach – word-based or bag of
words approach (usually works very well in text classification)
– the original word
– the root or lemma of the word (for “running” we use “run”)
– the parts of speech category of the word (determinant, noun, verb, etc.)
– the orthography of the word (all uppercase, lowercase, etc.)
• Each sentence/text is represented as a vector of features and values
– we carried out different combinations of features (different n-grams)
– 10-fold cross validation experiments were run over the corpus with binary
classifications (up/down)
– the combination of root and orthography (unigram) provides the best classifier
• around 80% F-score
– use of higher n-grams decreases performance of the classifier
– use of more features not necessarily improves performance
– a uninformed classifier would have a 67% accuracy
University of Sheffield NLP
Bag-of-words fine-grained
classification
• Same learning system used to produce the 5 stars
classification over the fine-grained dataset
• Same feature combinations were studied:
– 74% overall classification accuracy using word root
only
– other combinations degrade performance
– 1* classification accuracy = 80%; 5* classification
accuracy = 75%
– 2* = 2%; 3*=3%; 4*=19%
– 2*, 3*, 4* difficult to classify because or either share
vocabulary with extreme cases or are vague
University of Sheffield NLP
Sentiment-based Classifier
• Engineered features based on “linguistic” and sentiment information
associated to words
• Linguistic features
– word-based features are restricted to adjective and adverbs and their bigram
combinations
– “good”, “bad”, “rather”, “quite”, “not”, etc.
• Sentiment information
– WordNet lexical database where words appear with their senses and synonyms
• chair = the furniture
• chair, professorship = the position
• chair, president, chairman, … = the officer
• chair, electric chair, … = execution instrument
– SentiWordNet adds sentiment information to WordNet and has been used in
opinion mining and sentiment analysis
University of Sheffield NLP
Sentiment-based classifier
• SentiWordNet (cont.)
– each word has three numerical scores (between 0 and 1): obj, pos,
neg (obj+neg+pos=1)
Sentiment-base classifier
• Features deduced from SentiWordNet
– word analysis:
• countP(w) : the word positivity score (#(pos(w)>neg(w)))
• countN(w) : the word negativity score (#(pos(w)<neg(w)))
• countF(w): the number of entries of w in SentiWordNet
– sentence analysis
• sentiP: number of positive words in sentence
– a word is positive if countP(w)>½countF(w)
• sentiN: number of negative words in sentence
– a word is negative if countN(w)>½countF(w)
• senti: pos (sentiP > sentiN), neg (sentiN > sentiP), neutral (sentence feature)
– text analysis:
• count_pos: number of pos sentences in text
• count_neg: number of neg sentences in text
• count_neutral: number of neutral sentences in text
University of Sheffield NLP
Sentiment-based Classifier
• Each text is represented as a vector of features and values
– combining the linguistic features (adjectives, adverbs, and their
combinations) and the senti, count_pos, count_neg,
count_neutral features
– 10-fold cross validation experiments were run over the corpus
with binary classifications (up/down)
• overall F-score 76%
– 10-fold cross validation over the fine-grained corpus
• overall F-score 72%
• 1*=58%, 2*=24%, 3*=20%, 4*=19%, 5*=83% (better job in
less extreme categories)
University of Sheffield NLP