L5 TextClassification Updated

NLP
Introduction to NLP
511.
Classification
Classification
• Assigning documents or sentences to predefined
categories
– topics, languages, users …
• Input:
– a document (or sentence) d
– a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c  C
Variants of Problem Formulation
• Binary categorization: only two categories
– Retrieval: {relevant-doc, irrelevant-doc}
– Spam filtering: {spam, non-spam}
– Opinion: {positive, negative}
• K-category categorization: more than two categories
– Topic categorization: {sports, science, travel, business,…}
– Word sense disambiguation:{bar1, bar2, bar3, …}
• Hierarchical vs. flat
• Overlapping (soft) vs non-overlapping (hard)
Hierarchical Classification
Image from Schütze & Krisnawati

Borges’s Classification
• The list divides all animals into 14 categories:
• Those that belong to the emperor
• Embalmed ones
• Those that are trained
• Sucking pigs
• Mermaids (or Sirens)
• Fabulous ones
• Stray dogs
• Those that are included in this classification
• Those that tremble as if they were mad
• Innumerable ones
• Those drawn with a very fine camel hair brush
• Et cetera
• Those that have just broken the flower vase
• Those that, at a distance, resemble flies
Celestial Emporium of Benevolent Knowledge

Components of a Classification System
Mathematical Formulation
Spam Recognition
Return-Path: <*****@rediffmail.com>
X-Sieve: CMU Sieve 2.2
From: "Ibrahim Galadima" <*****@rediffmail.com>
Reply-To: galadima_esq@netpiper.com
To: *****
Subject: Gooday
DEAR SIR
FUNDS FOR INVESTMENTS
THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS

CORRESPONDENCE WITH YOU
I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL

COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A
RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL
TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE
MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A
SAFE FOREIGN ACCOUNT
SpamAssassin
• http://spamassassin.apache.org/
• http://wiki.apache.org/spamassassin/HowScoresAreAssigned
• http://spamassassin.apache.org/tests_3_3_x.html
• Example features:
– body Incorporates a tracking ID number
– body HTML and text parts are different
– header Date: is 3 to 6 hours before Received: date
– body HTML font size is huge
– header Attempt to obfuscate words in Subject:
– header Subject =~ /^urgent(?:[\s\W]*(dollar) | .{1,40}(?:alert| response|
assistance| proposal| reply| warning| noti(?:ce| fication)| greeting| matter))/i
Features for Classification
• Vector-based
– Words: “cat”, “dog”, “great”, “horrible”, etc.
– Meta features: document length, author name, etc.
– Each document (or sentence) is represented as a
vector in an n-dimensional space
• Similar documents appear nearby in the vector space
– (more later)
Classification in NLP
• Part of speech tagging
• Sentiment analysis
• Word sense disambiguation
• Parsing
• Optical character recognition
• Spelling correction
• Named entity classification
Introduction to NLP
Vector Space Classification

Vector Space Classification
x2
topic2
topic1
x1
18
Decision surfaces
x2
topic2
topic1
x1
Decision trees
x2
topic2
topic1
x1
Classification Using Centroids
• Centroid
– the point most representative of a class
• Compute centroid by finding vector average of known
class members
• Decision boundary is a line that is equidistant from two
centroids.
• New document on one side of the goes in one class;
new document on the other side goes in the other.
Linear boundary
x2
topic2
topic1
x1
Classification Using Centroids
x2
topic2
topic1
centroid2
centroid1
x1
23
Introduction to NLP
Linear Classifiers
Decision Boundary
Decision Boundary
Decision Boundary
Decision Boundary
x
Decision Boundary
w
Linear Separators
• Two-dimensional line:
w1x1+w2x2=b is the linear separator
w1x1+w2x2>b for the positive class
• In n-dimensional spaces:
𝑤 𝑇 𝑥Ԧ = σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 𝑤𝑛𝑥 n
• One can also add w1=1, x0=b (constant)
• w is the weight vector
• x is the feature vector
Example
T 
w x b
wi xi wi xi
• Bias b=0 (in this example) 0.6 A -0.7 G
0.5 B -0.5 H
• Sentence is “A D E H” 0.5 C -0.3 I
0.4 D -0.2 J
• Its score will be 0.4 E -0.2 K
0.3 F -0.2 L
0.6*1+0.4*1+0.4*1+(-0.5)*1 = 0.9>0
How to Find the Linear Boundary?

• Find the linear boundary = find w
• Many methods
– E.g., Perceptron x2 +
+
• Problem: +++ ++ +
+
-
– There are infinite number of + +++
- -
-
-
linear boundaries if the two -
- --
classes are linearly separable! - -
- - - -
– Maximum margin: Support
Vector Machines (SVM) x1
General Training Idea
• Go through the training data
• Predict the class y (1 or -1)
• If the prediction is wrong, update θ

Perceptron Algorithm (preview)
[Example: Chris Bishop]
Interpolation using Features
• Linear interpolation for language modeling
– Estimating the trigram probability P(c|ab) using PMLE(c|a), PMLE(c), etc.
– Weights 1, 2, etc.
• We may want to consider other features
– E.g., POS tags of previous words, heads, word endings, etc.
• General idea
– Compute the conditional probability P(y|x)
– P(y|x) = sum of weights*features
• Label Y for a given history x in X
Logistic Regression (preview)
[discriminative method] [Greg Durrett]

Naïve Bayes (preview)
• Multinomial Naïve Bayes is a linear model

x= [1, x1, x2 …]
w = [log P(y), log P(w1|y), log P(w2|y) …]
Example from Karl Stratos
NLP
Introduction to NLP
513.
Naïve Bayes
Generative vs. Discriminative Models
Generative Discriminative
• Learn a model of the joint probability p(d, c) • Model posterior probability p(c|d) directly
• Use Bayes’ Rule to calculate p(c|d) • Class is a function of document vector
• Build a model of each class; given example, • Find the exact function that minimizes
return the model most likely to have classification errors on the training data
generated that example • Examples:
• Examples: – Logistic regression
– Naïve Bayes – Neural Networks (NNs)
– Gaussian Discriminant Analysis – Support Vector Machines (SVMs)
– HMM – Decision Trees
– Probabilistic CFG
Assumptions of Discriminative
Classifiers
• Data examples (documents) are represented as vectors
of features (words, phrases, ngrams, etc)
• Looking for a function that maps each vector into a class.
• This function can be found by minimizing the errors on
the training data (plus other various criteria)
• Different classifiers vary on what the function looks like,
and how they find the function
Discriminative vs. Generative
• Discriminative classifiers are generally more effective,
since they directly optimize the classification accuracy. But
– They are all sensitive to the choice of features, and so far these
features are extracted heuristically
– Also, overfitting can happen if data is sparse
• Generative classifiers are the “opposite”
– They directly model text, an unnecessarily harder problem than
classification
– They can easily exploit unlabeled data
Naïve Bayes Intuition
• Simple (“naïve”) classification method based on

Bayes rule
• Relies on very simple representation of
document
– Bag of words
Bag of Words
Remember Bayes’ Rule?
P ( d | c ) P (c )
Bayes’ Rule: P (c | d ) 
P(d )
• d is the document (represented as a list of features of a
document, x1, …, xn)
• c is a class (e.g., “not spam”)
Naïve Bayes Classifier
cMAP  argmax P(c | d )

MAP is “maximum a
posteriori” = most likely
cC class
P ( d | c ) P (c )
 argmax Bayes Rule
cC P(d )
 argmax P (d | c) P (c) Dropping the
denominator
cC
Naïve Bayes Classifier
cMAP  argmax P (d | c) P(c)

cC
Document d
 argmax P ( x1 , x2 ,  , xn | c) P (c) represented as
features x1..xn
cC
But where will we get these probabilitites?

Naïve Bayesian Classifiers
• Naïve Bayesian classifier

P( F1 , F2 ,...Fk | d  C ) P(d  C )
P(d  C | F1 , F2 ,...Fk ) 
P( F1 , F2 ,... Fk )
• Assuming statistical independence

k
j 1
P( F j | d  C ) P(d  C )
P(d  C | F , F ,...F ) 

1 2 k k
j 1
P( F j )
• Features = words (or phrases) typically

Training Naïve Bayes
Learning the Parameters
• First attempt: maximum likelihood estimates

– use the frequencies in the data
ˆ doccount (C  c j )
P (c j ) 
N doc
ˆ count ( wi , c j )
P ( wi | c j ) 
 count (w, c j )
wV
Parameter Estimation
ˆ count ( wi , c j ) fraction of times word wi appears

P ( wi | c j ) 
 count (w, c j )
wV
among all words in documents of topic cj
• Create mega-document for topic j by concatenating all

docs in this topic
– Use frequency of w in mega-document
Problem with Maximum Likelihood
• What if we have seen no training documents with the word

fantastic and classified in the topic positive (thumbs-up)?
ˆ count (" fantastic" , positive)
P(" fantastic" positive)  0
 count (w, positive)
wV
• Zero probabilities cannot be conditioned away, no matter the

other evidence!
cMAP  argmax c Pˆ (c)i Pˆ ( xi | c)
Laplace Smoothing
count ( wi , c)
Pˆ ( wi | c) 
 count (w, c)
wV
count ( wi , c)  1
Pˆ ( wi | c) 
 count (w, c)  1
wV
count ( wi , c)  1

 
  count ( w, c)   V
 wV 
Multinomial Naïve Bayes
Independence Assumptions
P( x1 , x2 ,  , xn | c)
• Bag of Words assumption
– Assume position doesn’t matter
• Conditional Independence
– Assume the feature probabilities P(xi|c) are independent given
the class c.
P( x1 ,  , xn | c)  P( x1 | c)  P( x2 | c)  P( x3 | c)    P( xn | c)
[Jurafsky and Martin]
Multinomial Naïve Bayes
cMAP  argmax P( x1 , x2 ,  , xn | c) P (c)

cC
c NB  argmax P(c j ) P ( x | c)
cC x X
This is why it’s naïve!

Multinomial NB - Learning
• From training corpus, extract Vocabulary

• Calculate P(cj) terms • Calculate P(wk | cj) terms
– For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
| docs j |
P (c j )  nk  
| total # documents | P( wk | c j ) 
n   | Vocabulary |
Multinomial NB Classifier for Text
positions  all word positions in test document
c NB  argmax P(c j )
c jC
 P( x | c )
i positions
i j
Generative Model for Multinomial NB
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Example
𝑃 𝑌 = 1/2 1/2
• Features = {I hate love this book} 1 1 0 1 1
• Training 𝑀=
0 0 1 1 1
– I hate this book
– Love this book
1/4 1/4 0 1/4 1/4
𝑃 𝑋𝑌 =
0 0 1/3 1/3 1/3
• What is P(Y|X)?
• Prior p(Y) 𝑃 𝑌|𝑋  1/2 × 1/4 × 1/4 1/2 × 0 × 1/3 = 1 0
• Testing
– hate book 2 2 1 2 2
𝑀=
• Different conditions 1 1 2 2 2
– a = 0 (no smoothing) 2/9 2/9 1/9 2/9 2/9
𝑃 𝑋𝑌 =
– a = 1 (smoothing) 1/8 1/8 2/8 2/8 2/8
𝑃 𝑌|𝑋  1/2 × 2/9 × 2/9 1/2 × 1/8 × 2/8 = 0.613 0.387
Another Example
[Greg Durrett]
Summary (1)
[Greg Durrett]
Summary (2)
[Greg Durrett]
Not So Naïve after all!
• Very fast, low storage requirements
• Robust to irrelevant features
• Irrelevant features cancel each other without affecting results
• Very good in domains with many equally important features
– Decision trees suffer from fragmentation in such cases – especially if little data
• Optimal if the independence assumptions hold
– If assumed independence is correct, then it is the Bayes Optimal Classifier for
problem
• A good, dependable baseline for text classification
– But other classifiers give better accuracy
Naïve Bayes as a Linear Model
• Multinomial Naïve Bayes is a linear model

x= [1, x1, x2 …]
w = [log P(y), log P(w1|y), log P(w2|y) …]
Summary of NB (1/3)
Summary of NB (2/3)
Summary of NB (3/3)
NB is a generative model
NLP
Introduction to NLP
541.
Text Classification
Who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince

New York to ratify U.S Constitution: Jay,
Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton

Male or female author?
1. By 1925 present-day Vietnam was divided into three parts
under French colonial rule. The southern region embracing
Saigon and the Mekong delta was the colony of Cochin-
China; the central area with its imperial capital at Hue was
the protectorate of Annam…
2. Clara never failed to be astonished by the extraordinary
felicity of her own name. She found it hard to trust herself to
the mercy of fate, which had managed over the years to
convert her greatest shame into one of her greatest assets…
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3,
pp. 321–346
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and
some great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the
boxing scenes.
What is the subject of this article?
MEDLINE Article
MeSH Subject Category Hierarchy
• Antagonists and
Inhibitors
• Blood Supply
? • Chemistry
• Drug Therapy
• Embryology
• Epidemiology
• …
Introduction to NLP
Feature Selection
Feature selection: The 2 test
• For a term t: It
0 1
C 0 k00 k01
1 k10 k11
• C=class, it = feature
• Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0)
P(C=0) = (k00+k01)/n
P(C=1) = 1-P(C=0) = (k10+k11)/n
P(It=0) = (k00+K10)/n
P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature Selection: The 2 test
n( k11k 00  k10 k 01 ) 2
Χ 2

( k11  k10 )(k 01  k 00 )(k11  k 01 )(k10  k 00 )
• High values of 2 indicate lower belief in

independence.
• In practice, compute 2 for all words and pick the top k
among them.
Mutual Information
• Measures amount of information
– X = word; Y = class
P ( x, y )
MI ( X , Y )    P ( x, y ) log
x y P( x) P( y )
– if the distribution is the same as the background distribution, then MI=0
• Documents are assumed to be generated
according to the multinomial model
Introduction to NLP
Evaluation of Classification
Accuracy
• Percentage of items correctly classified

• Not a great metric for imbalanced data sets. Why?
Contingency Table
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F
measure (weighted harmonic mean):
1 (  2  1) PR
F 
1
  (1   )
1  2
PR
P R
• The harmonic mean is a very conservative average

• People usually use balanced F1 measure
– i.e., with  = 1 (that is,  = ½): F1 = 2PR/(P+R)
More than two classes
Sec. 15.2.4
Micro-averaging and Macro-averaging
If we have more than one class, how do we combine

multiple performance measures into one quantity?
• Macroaveraging: Compute performance for each class,
then average.
• Microaveraging: Collect decisions for all classes,
compute contingency table, evaluate.
Micro-averaging and Macro-averaging
Classic Reuters-21578 Data Set
• Most (over)used data set, 21,578 docs (each 90 types, 200 tokens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)
Confusion Matrix
• For each pair of classes <c1,c2> how many documents
from c1 were incorrectly assigned to c2?
– c3,2: 90 wheat documents incorrectly assigned to poultry
Docs in test set Assigned Assigned Assigned Assigned Assigned Assigned
UK poultry wheat coffee interest trade
True UK 95 1 13 0 1 0
True poultry 0 1 0 0 0 0
True wheat 10 90 0 1 0 0
True coffee 0 0 0 34 3 7
True interest - 1 2 13 26 5
95True trade 0 0 2 14 5 10
Micro- vs. Macro-Averaging
– If we have more than one class, how do we combine

multiple performance measures into one quantity?
• Macroaveraging: Compute performance for each
class, then average.
• Microaveraging: Collect decisions for all classes,
compute contingency table, evaluate.
Pitfall: Overfitting
What happens when your model learns your training data a little too well?
Development Test Sets and
Cross-validation
Training set Development Test Set Test Set
• Metric: P/R/F1 or Accuracy

• Unseen test set Training Set Dev Test
– avoid overfitting (‘tuning to the test set’)
– more conservative estimate of performance Training Set Dev Test
– Cross-validation over multiple splits
• Handle sampling errors from different datasets Dev Test Training Set
– Pool results over each split
– Compute pooled dev set performance Test Set
Cross-validation
Underflow Prevention: log space
• Multiplying lots of probabilities can result in floating-point underflow.
• Since log(xy) = log(x) + log(y)
– Better to sum logs of probabilities instead of multiplying probabilities.
• Class with highest un-normalized log probability score is still most
probable.
cNB = argmax log P(c j ) +

c j ÎC
å log P(xi | c j )
iÎ positions
• Model is now just max of sum of weights

NLP
Introduction to NLP
516.
Logistic Regression
Combining Features
• Linear interpolation for language modeling
– Estimating the trigram probability P(c|ab)?
– features: PMLE(c|a), PMLE(c), etc.
– Weights: 1, 2, etc.
• We may want to consider other features
– E.g., POS tags of previous words, heads, word endings, etc.
• General idea
– Compute the conditional probability P(y|x)
– P(y|x) = sum of weights*features
• Label Y for a given history x in X
Logistic Regression
• Similar to Naïve Bayes (but discriminative!)
– Log-linear model
– Features don’t have to be independent
• Examples of features
– Anything of use
– Linguistic and non-linguistic
– Count of “good”
– Count of “not good”
– Sentence length
• Convex optimization problem
– It can be solved using gradient descent
Feature Templates
• For each word $w

– $w_count
– $w_ends_in_er
– $w_is_negated
– $w_is_all_caps
Logistic Regression
• An example of a discriminative classifier
• Input:
– Training example pairs of (𝑥,
റ 𝑦) where 𝑥റ is the feature vector and y
is the label
• Goal:
– Build a model that predicts the probability of the label
• Output:
– Set of weights 𝑤 that maximizes likelihood of correct labels on
training examples
Loglinear Models
• Naive Bayes, HMM, and Logistic Regression can all be
expressed in a similar way:
𝐺 𝑦, 𝜃 = 𝜃 𝑇 𝑓(𝑥, 𝑦)
𝑝 𝑦 𝑥, 𝜃 ∝ exp 𝐺 𝑦, 𝜃 ֞ log 𝑝 𝑦 𝑥 = 𝐶 + 𝐺(𝑦, 𝜃)
• Decision rule:
argmax 𝐺(𝑦 ∗ , 𝜃)
𝑦∗
Mapping x to a 1-D coordinate
[image from Greg Shakhnarovich]

Logistic Regression
[discriminative method] [Greg Durrett]

Sigmoid Function
Including a Nonlinearity
• Compute the feature vector x
• Multiply with weight vector w
𝑧 = ෍ 𝑤𝑖 𝑥𝑖
• Compute the logistic function
(sigmoid)
1
𝑓 𝑧 =
1 + 𝑒 −𝑧
Examples
• Example 1
x = (2,1,1,1)
w = (1,-1,-2,3)
z = 2-1-2+3=2
f(z) = 1/(1+e-2)
• Example 2
x = (2,1,0,1)
w = (0,0,-3,0)
z=0
f(z) = 1/(1+e0) = 1/2
Non-linear Activation Functions
Training and Testing LR
Classification using LR
Example
Example
Per-class Probabilities
Period Disambiguation
Cross-Entropy Loss (5.9)
Two examples (.69 vs .31)
Gradient Descent
Bivariate Gradient
Using the Gradient
Updating using the Gradient
Gradient for Logistic Regression
Stochastic Gradient Descent
Example (Cont’d)
Example (Cont’d)
Regularization
L2 Regularization
L1 Regularization
Constraining the Weights (1/2)
Constraining the Weights (2/2)
Multinomial LR
• More than two classes

Using Softmax
Sigmoid vs. Softmax
Features in Multinomial LR
Loss Function for Multinomial LR
Gradient of the Loss Function for LR
Deriving the Gradient Equation
Deriving the Gradient Equation (1/2)
Deriving the Gradient Equation (2/2)
Summary
Summary of Learning Methods
Non-linear Learning Methods
Softmax
• Recall
𝑝 𝑦 𝑥, 𝜃 ∝ exp(𝜃 𝑇 𝑓 𝑥, 𝑦 )
exp(𝜃 𝑇 𝑓 𝑥, 𝑦 )
𝑝 𝑦 𝑥, 𝜃 = 𝑚
σ𝑦′ =1 exp(𝜃 𝑇 𝑓 𝑥, 𝑦 ′ )
• Is this a valid probability distribution?

• Yes, using softmax
𝑝 𝑦 𝑥, 𝜃 = softmaxy (𝜃 𝑇 𝑓 𝑥, 𝑦 )
Stochastic Gradient Descent (SGD)
[Example from Karl Stratos]

SGD for Linear Regression
[Example from Karl Stratos]

Logistic Regression - Summary
[Example from Greg Durrett]

NLP
Introduction to NLP
515.
Perceptrons
The Perceptron
• A simple but very important
(discriminative) classifier
• Model of a neuron
– Input excitations
– If excitation > inhibition, send an
electrical signal out the axon
• Earliest neural network
– invented in 1957 by Frank
Rosenblatt at Cornell
Perceptron Idea
Input
-1
x0 w0 no
w1
x1
x2
w2 Σ > threshold?
yes
1
Question: Where can we get these weights?

So we can rewrite
x0 w0
w1
x1
x2
w2 Σ
as a dot product of two vectors
𝒘∙𝒙
Gradient Descent
https://en.wikipedia.org/wiki/Gradient_descent
Updating Parameters
• Basic approach
– We want to increase the probability of the entire data set
𝑤𝑀𝐿𝐸 = argmax𝑤 ෍ 𝑙𝑜𝑔𝑃(𝑦𝑖 |𝑥𝑖 ; 𝑤)
– Gradient descent
– Take the derivative of the log likelihood with respect to the parameters
– Make a little change to the parameters in the direction the derivative
tells you is uphill: 𝜕
𝑤≔ 𝑤+ 𝛼 𝐿(𝑤)
𝜕𝑤
• α here is the learning rate
– how much do you want to change each time?
Non-linearities: sigmoid, tanh
Notes about perceptrons
• Bag of words is not enough
• If the data is separable, the perceptron will separate it.

• Automatic differentiation (codegen)
NLP
Introduction to NLP
381.
Sentiment Analysis
Reviews of 1Q84 by Haruki Murakami
• “1Q84 is a tremendous feat and a triumph . . . A must-read for anyone who
wants to come to terms with contemporary Japanese culture.”
—Lindsay Howell, Baltimore Examiner
• “Perhaps one of the most important works of science fiction of the year . . .
1Q84 does not disappoint . . . [It] envelops the reader in a shifting world of
strange cults and peculiar characters that is surreal and entrancing.”
—Matt Staggs, Suvudu.com
• Ambitious, sprawling and thoroughly stunning . . . Orwellian dystopia, sci-fi,
the modern world (terrorism, drugs, apathy, pop novels)—all blend in this
dreamlike, strange and wholly unforgettable epic.”
—Kirkus Reviews (starred review)
Sentiment about Companies
Product Reviews
Introduction
• Many posts, blogs
• Expressing personal opinions
• Research questions
– Subjectivity analysis
– Polarity analysis (positive/negative, number of stars)
– Viewpoint analysis (Chelsea vs. Manchester United, republican vs.
democrat)
• Sentiment target
– entity
– aspect
Introduction
• Level of granularity
– Document
– Sentence
– Attribute
• Opinion words
– Base
– Comparative (better, slower)
• Negation analysis
– Just counting negative words is not enough
Observations
• “raise”
– can be negative or positive depending on the object
• “very”
– enhances the polarity of the adjective or adverb
• “advanced”, “progressive”
• http://www.theysay.io/
– claim to have 65,000 manual patterns
SA as a Classification Problem
• Set of features
– Words
• Presence is more important than frequency
– Punctuation
– Phrases
– Syntax
• A lot of training data is available
– E.g., movie review sentences and stars
• Techniques
– MaxEnt, SVM, Naive Bayes, RNN
Results
https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md
Results
Difficult Problems
• Subtlety
• Concession
• Manipulation
• Sarcasm and irony
NLP
Introduction to NLP
382.
Sentiment Lexicons
Sentiment Lexicons
• SentiWordNet
– http://sentiwordnet.isti.cnr.it/
• General Inquirer
– 2,000 positive words and 2,000 negative words
– http://www.wjh.harvard.edu/~inquirer/
• LIWC
– http://liwc.wpengine.com/
• Bing Liu’s opinion dataset
– http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
• MPQA subjectivity lexicon
– http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
General Inquirer
• Annotations
– Strong Power Weak Submit Active Passive Pleasur Pain Feel Arousal EMOT Virtue Vice Ovrst Undrst
Academ Doctrin Econ@ Exch ECON Exprsv Legal Milit Polit@ POLIT Relig Role COLL Work Ritual
SocRel Race Kin@ MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky
Object Tool Food Vehicle BldgPt ComnObj NatObj BodyPt ComForm COM Say Need Goal Try
Means Persist Complet Fail NatrPro Begin Vary Increas Decreas Finish Stay Rise Exert Fetch Travel
Fall Think Know Causal Ought Perceiv Compare Eval@ EVAL Solve Abs@ ABS Quality Quan NUMB
ORD CARD FREQ DIST Time@ TIME Space POS DIM Rel COLOR Self Our You Name Yes No
Negate Intrj IAV DAV SV IPadj IndAdj PowGain PowLoss PowEnds PowAren PowCon PowCoop
PowAuPt PowPt PowDoct PowAuth PowOth PowTot RcEthic RcRelig RcGain RcLoss RcEnds RcTot
RspGain RspLoss RspOth RspTot AffGain AffLoss AffPt AffOth AffTot WltPt WltTran WltOth WltTot
WlbGain WlbLoss WlbPhys WlbPsyc WlbPt WlbTot EnlGain EnlLoss EnlEnds EnlPt EnlOth EnlTot
SklAsth SklPt SklOth SklTot TrnGain TrnLoss TranLw MeansLw EndsLw ArenaLw PtLw Nation
Anomie NegAff PosAff SureLw If NotLw TimeSpc
• http://www.webuse.umd.edu:9090/tags
– Positive: able, accolade, accuracy, adept, adequate…
– Negative: addiction, adversity, adultery, affliction, aggressive…
Dictionary-based Methods
• Start from known seeds

– e.g., happy, angry
• Expand using WordNet
– synonyms
– hypernyms
• Random-walk based methods
– words with known polarity as absorbing boundary
Automatic Extraction of Sentiment
Words
• Semi-supervised method
• Look for pairs of adjectives that appear together in a
conjunction
Vasileios Hatzivassiloglou and Kathleen R. McKeown, ACL 1997

Molistic
NACLO problem (2007)

brastic
11 cluvious
molistic
3
1 danty
slatty 4
5 cloovy
blitty
10
struffy
weasy 8
9
strungy
6
sloshful 8
pleasure to watch
7
frumsy
PMI (Turney)
• PMI = pointwise mutual information
• Check how often a given unlabeled word appears with a known positive
word (“excellent”)
• Same for a known negative word (“poor”)
hits ( word1 NEAR word 2 )

PMI( word1 , word 2 )  log 2
hits ( word1 )hits ( word 2 )
NLP

L5 TextClassification Updated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L5 TextClassification Updated

Uploaded by

Copyright:

Available Formats

NLP

Image from Schütze & Krisnawati

Celestial Emporium of Benevolent Knowledge

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL

Vector Space Classification

• If the prediction is wrong, update θ

[discriminative method] [Greg Durrett]

• Multinomial Naïve Bayes is a linear model

• Simple (“naïve”) classification method based on

cMAP  argmax P(c | d )

cMAP  argmax P (d | c) P(c)

But where will we get these probabilitites?

• Naïve Bayesian classifier

• Features = words (or phrases) typically

• First attempt: maximum likelihood estimates

ˆ count ( wi , c j ) fraction of times word wi appears

• Create mega-document for topic j by concatenating all

• What if we have seen no training documents with the word

• Zero probabilities cannot be conditioned away, no matter the

cMAP  argmax P( x1 , x2 ,  , xn | c) P (c)

This is why it’s naïve!

• From training corpus, extract Vocabulary

positions  all word positions in test document

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

• Multinomial Naïve Bayes is a linear model

• 1787-8: anonymous essays try to convince

James Madison Alexander Hamilton

• High values of 2 indicate lower belief in

• Percentage of items correctly classified

• The harmonic mean is a very conservative average

Micro-averaging and Macro-averaging

If we have more than one class, how do we combine

– If we have more than one class, how do we combine

• Metric: P/R/F1 or Accuracy

cNB = argmax log P(c j ) +

• Model is now just max of sum of weights

• For each word $w

[image from Greg Shakhnarovich]

[discriminative method] [Greg Durrett]

• More than two classes

• Is this a valid probability distribution?

[Example from Karl Stratos]

[Example from Karl Stratos]

[Example from Greg Durrett]

Question: Where can we get these weights?

• If the data is separable, the perceptron will separate it.

• Start from known seeds

Vasileios Hatzivassiloglou and Kathleen R. McKeown, ACL 1997

NACLO problem (2007)

hits ( word1 NEAR word 2 )

You might also like