4 NB Apr 4 2021

Text The Task of Text Classification
Classification
and Naive
Bayes
Is this spam?
Who wrote which Federalist papers?
1787-8: anonymous essays try to convince New
York to ratify U.S Constitution: Jay, Madison,
Hamilton.
Authorship of 12 of the letters in dispute
1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton

What is the subject of this medical article?
MEDLINE Article MeSH Subject Category Hierarchy
Antogonists and Inhibitors
Blood Supply
Chemistry
? Drug Therapy
Embryology
Epidemiology
…
4
Positive or negative movie review?
+ ...zany characters and richly applied satire, and some great

plot twists
It was pathetic.
− scenes... The worst part about it was the boxing
...awesome caramel sauce and sweet toasty almonds. I

+ love this place!
− ...awful pizza and ridiculously overpriced...
5
Positive or negative movie review?
+ ...zany characters and richly applied satire, and some great

plot twists
It was pathetic.
− scenes... The worst part about it was the boxing
...awesome caramel sauce and sweet toasty almonds. I

+ love this place!
− ...awful pizza and ridiculously overpriced...
6
Why sentiment analysis?
Movie: is this review positive or negative?

Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment
7
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Basic Sentiment Classification
Sentiment analysis is the detection of

attitudes
Simple task we focus on in this chapter
◦ Is the attitude of this text positive or negative?
We return to affect classification in later
chapters
Summary: Text Classification
Sentiment analysis
Spam detection
Authorship identification
Language Identification
Assigning subject categories, topics, or genres
…
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c Î C

Classification Methods: Hand-coded rules
Rules based on combinations of words or other features

◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d à c
14
Classification Methods:
Supervised Machine Learning
Any kind of classifier
◦ Naïve Bayes
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Text The Task of Text Classification
Classification
and Naive
Bayes
Text The Naive Bayes Classifier
Classification
and Naive
Bayes
Naive Bayes Intuition
Simple ("naive") classification method based on

Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation
it
it it 6 6
I I I 5 5
I love
I loveI love
this this
movie!
this movie!
movie!It'sIt's It's
sweet, sweet,
sweet, thethe the 4 4
fairy always love it it it
fairy
butbut
but with with
satirical
with satirical
satiricalhumor. humor.
humor. The The
The fairy
it always lovelove
always to toto to to to 3 3
dialogue itit whimsical
whimsical it itit I and and 3 3
dialogue is great
dialogue is is great
andand
great and
the thethe whimsical
and seen areare I I and
adventure scenes are fun... and
and seen areanyone seen 2
adventure
adventure scenes
scenestoare are fun...
fun... friend seen anyone
anyone seen seen2
friend dialogue
It manages
It manages to be be whimsical
whimsical friendhappy happy dialogue
dialogue yetyet yet 1 1
It manages to be whimsical happy recommend would
andand romantic
romantic while
while laughing
laughing adventure
adventure recommend
recommend would 1 1
and romantic while laughing adventure sweet of satirical
satirical would 1
whimsical
at the conventions of of
at the conventions thethe whowho sweet of movie satirical
it it whimsical 1
at thefairy
conventions
fairy tale genre. of Ithewould who itsweet
I I but
to toof
movie
romantic it timeswhimsical
times 1 1
tale genre. I would it but movie
romantic I I
fairy recommend
tale genre. itI to
recommend would
it to just about several I to
yetyet sweet times1 1
just about it
several again but romantic
the humor I sweet
humor satirical
recommendanyone. I've
it to seen
just it several
itabout several again it yet
itthe satiricalsweet1 1
anyone. I've seen several thethe again seen humor
would adventure
anyone. times,
times,I'veand and I'mI'm
seen it always
alwaysseveral happy
happy to toscenesseen the
would
scenesI Iit the manages adventure 1 1
satirical
the genre
times,
to
to and
seeseeitI'm it again
again whenever
whenever
always happy I I fun
seen
thethe the manages
times
would genre 1 1
adventure
have a friend who hasn't fun toI Iscenes and I the
times andand
manages fairy
fairy 1 1
have
to see seen a
it again friend who
whenever I hasn't and the about
abouttimeswhile while humor genre 1
it yet! fun whenever haveand humor fairy 1
haveseen it yet!who hasn't
a friend whenever Iconventions
and about have have
have 1 1
conventions
with while
seen it yet! whenever
with greathumor 1 1
have great
conventions …… have……
with great
19
The bag of words representation
seen 2
γ( )=c
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Bayes’ Rule Applied to Documents and Classes
•For a document d and a class c
P(d | c)P(c)
P(c | d) =
P(d)
Naive Bayes Classifier (I)
MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class
P(d | c)P(c)
= argmax Bayes Rule
c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
Naive Bayes Classifier (II)
"Likelihood" "Prior"
cMAP = argmax P(d | c)P(c)

c∈C
Document d
represented as
= argmax P(x1, x2 ,…, xn | c)P(c) features
c∈C x1..xn
Naïve Bayes Classifier (IV)
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C
O(|X|n•|C|) parameters How often does this

class occur?
Could only be estimated if a

We can just count the
very, very large number of relative frequencies in
training examples was a corpus
available.
Multinomial Naive Bayes Independence
Assumptions
P(x1, x2 ,…, xn | c)
Bag of Words assumption: Assume position doesn’t matter

Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)

Multinomial Naive Bayes Classifier
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

c∈C
cNB = argmax P(c j )∏ P(x | c)

c∈C x∈X
Applying Multinomial Naive Bayes Classifiers
to Text Classification
positions ¬ all word positions in test document
cNB = argmax P(c j )

c j ∈C
∏ P(xi | c j )
i∈ positions
Problems with multiplying lots of probs
There's a problem with this:

c j ∈C
∏ P(xi | c j )
i∈ positions
Multiplying lots of probabilities can result in floating-point underflow!

.0006 * .0007 * .0009 * .01 * .5 * .000008….
Idea: Use logs, because log(ab) = log(a) + log(b)
We'll sum logs of probabilities instead of multiplying probabilities!
We actually do everything in log space
Instead of this: cNB = argmax P(c j ) ∏ P(xi | c j )
c j ∈C i∈ positions
<latexit sha1_base64="o0LQfSf3I3G0xas3oLJOwQZR0GU=">AAACoXicbVFdaxQxFM2MH63r16qPggQXoSIsMwWxL0JpfdAHyypuW5gMQyZ7ZzZ2koxJRnaJ+V/+Dt/8N2Z2R6itF0IO597Dvffcsm24sUnyO4pv3Lx1e2f3zujuvfsPHo4fPT41qtMM5kw1Sp+X1EDDJcwttw2ctxqoKBs4Ky+O+/zZd9CGK/nFrlvIBa0lrzijNlDF+CdZQEWorgVdOSKoXarWES3wlvJ+REqouXTwTVKt6dqPWOGIhZV1J0fe47d4UBeOFV8Jl/jYY9JAZbPwqRrP9gL/Er/CxHSicLwvCZ1KtXKtMrwfw3jv/xavCv6jFxDN66XNMZFKdqIETUAuLk1RjCfJNNkEvg7SAUzQELNi/IssFOsESMsaakyWJq3NHdWWswbCnp2BlrILWkMWoKQCTO42Dnv8IjALXCkdnrR4w15WOCqMWYsyVPYemqu5nvxfLutsdZA7LtvOgmTbRlXXYKtwfy684BqYbdYBUKaDWwyzJdWU2XDU3oT06srXwen+NH09TT7tTw6PBjt20VP0HO2hFL1Bh+g9mqE5YtGz6F30MTqJJ/GHeBZ/3pbG0aB5gv6JOPsD0yvRAA==</latexit>
2 3
X
This: cNB = argmax 4log P (cj ) + log P (xi |cj )5
cj 2C
i2positions
Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text The Naive Bayes Classifier
Classification
and Naive
Bayes
Text
Classification Naive Bayes: Learning
and Naïve
Bayes
Sec.13.3
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates

◦ simply use the frequencies in the data
𝑁"!
𝑃! 𝑐! =
𝑁#$#%&
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
Parameter estimation
P̂(wi | c j ) =
count(wi , c j ) fraction of times word wi appears
∑ count(w, c j ) among all words in documents of topic cj
w∈V
Create mega-document for topic j by concatenating all

docs in this topic
◦ Use frequency of w in mega-document
Sec.13.3
Problem with Maximum Likelihood
What if we have seen no training documents with the word fantastic

and classified in the topic positive (thumbs-up)?
count("fantastic", positive)
P̂("fantastic" positive) = = 0
∑ count(w, positive)
w∈V
Zero probabilities cannot be conditioned away, no matter the other

evidence!
cMAP = argmax c P̂(c)∏ P̂(xi | c)
i
Laplace (add-1) smoothing for Naïve Bayes
count(wi , c) +1
P̂(wi | c) =
∑ (count(w, c))+1)
w∈V
count(wi , c) +1
=
# &
%% ∑ count(w, c)(( + V
$ w∈V '
Multinomial Naïve Bayes: Learning
• From training corpus, extract Vocabulary

Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Textj ¬ single doc containing all docsj
docsj ¬ all docs with class =cj • For each word wk in Vocabulary
nk ¬ # of occurrences of wk in Textj
| docs j |
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocabulary?
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all!
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally helpful!
Stop words
Some systems ignore stop words
◦ Stop words: very frequent words like the and a.
◦ Sort the vocabulary by word frequency in training set
◦ Call the top 10 or 50 words the stopword list.
◦ Remove all stop words from both training and test sets
◦ As if they were never there!
But removing stop words doesn't usually help

• So in practice most NB algorithms use all words and don't
use stopword lists
Text
Classification Naive Bayes: Learning
and Naive
Bayes
Text Sentiment and Binary
Classification Naive Bayes
and Naive
Bayes
gative (-), and take the following miniature training and te
Let's do a worked sentiment example!
from actual movie reviews.
Cat Documents
Training - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
Nc
or P(c) for the two classes is computed via Eq. 4.11 as
g. We’ll Training -
use a sentimentjust plain boring
analysis domain with the Thetwo
wordclasses
with positive
doesn’t occur in the training set, so we drop
A worked sentiment example with add-1 smoothing
-
-
entirely
no
d from actual movie reviews.
predictable
surprises and
and
very
lacks
egative (-), and take the following miniature training
few
energy
andabove,
mentioned
laughs
test documents
we don’t use unknown word models for naive
lihoods from the training set for the remaining three words “pred
+ very powerful
“fun”, are as follows, from Eq. 4.14 (computing the probabilities
Cat
+ the most fun Documents
film of the summer 1. Prior from training:
Training
Test -? just with no fun of the words in the training set is left as an exercise for the reader
plain boring
predictable
- entirely predictable and lacks energy
prior P(c) for the two classes is computed via Eq. 4.11 as N : Nc
𝑃/ 𝑐 =
𝑁&!
1+1 P(-) = 3/5
- no surprises and very few laughs doc P(“predictable”|
%
𝑁)'(')*
= P(“predictable”|+) =
14 + 20 P(+) = 2/5
+ very powerful
3 2
P( ) = P(+) = 1+1 0+1
+ the most5 fun film of5the summer P(“no”| ) = P(“no”|+) =
word Test
with doesn’t? occur
predictable with no
in the training set,fun 2. (asDrop 14
so we drop it completely "with"
+ 20 9 + 20
ned above, we don’t use unknown word models for naive Bayes). The like- 0+1 1+1
rior3.
P(c) for the two classes is computed via Eq. 4.11 as Nc
: P(“fun”| ) = P(“fun”|+) =
from theLikelihoods fromthree
training set for the remaining training: Ndoc “no”, and
words “predictable”, 14 + 20 9 + 20
are as follows, from Eq. 4.14
𝑐𝑜𝑢𝑛𝑡 (computing
3 𝑤! , 𝑐 +the
1 probabilities for the
2 For the test remainder
sentence S = “predictable with no fun”, after removin
𝑝 𝑤 𝑐 = P( is)left P(+) =for the reader):
words in the training
! set
∑
= as an exercise
"∈$ 𝑐𝑜𝑢𝑛𝑡
5 𝑤, 𝑐 + |𝑉| 4. Scoring the test set:
5 the chosen class, via Eq. 4.9, is therefore computed as follows:
word with doesn’t occur 1 +the
in 1 training set, so we drop0it+completely
1 (as
P(“predictable”| ) = P(“predictable”|+) = 3 2⇥2⇥1 5
14 + 20 word models for naive9Bayes).
d above, we don’t use unknown + 20 P( like-
The )P(S| ) = ⇥ 3
= 6.1 ⇥ 10
1 1 0 1
5 34
rom the training set for the remaining three words “predictable”, “no”, and
+ +
P(“no”| ) = P(“no”|+) = 2 1⇥1⇥2
e as follows, from Eq. 4.14 14 +(computing
20 the probabilities
9 + 20 for the remainder
P(+)P(S|+) = ⇥ = 3.2 ⇥ 10 5
5 29 3
0 +as
rds in the training set is left 1 an exercise for the1reader):
+1
P(“fun”| ) = P(“fun”|+) =
14 + 20 9 + 20
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence seems to
be more important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
• From training corpus, extract Vocabulary
Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Text j ¬ duplicates
Remove single docincontaining
each doc: all docsj
docsj ¬ all docs with class =cj • For
• For eacheach word
word wktype w in docj
in Vocabulary
n• ¬Retain
# of only a single instance
occurrences of w inofText
w
| docs j | k k j
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation:

c j ∈C
∏ P(wi | c j )
i∈ positions
45
without add-1 smoothing to make the differences clearer. Note that the results counts
need not be 1; the word great has a count of 2 even for Binary NB, because it appears
Binary multinominal naive Bayes
in multiple documents.
NB Binary
Counts Counts
Four original documents: + +
it was pathetic the worst part was the and 2 0 1 0
boxing scenes boxing 0 1 0 1
film 1 0 1 0
no plot twists or great scenes great 3 1 2 1
+ and satire and great plot twists it 0 1 0 1
+ great scenes great film no 0 1 0 1
or 0 1 0 1
After per-document binarization: part 0 1 0 1
it was pathetic the worst part boxing pathetic 0 1 0 1
plot 1 1 1 1
scenes satire 1 0 1 0
no plot twists or great scenes scenes 1 2 1 2
+ and satire great plot twists the 0 2 0 1
+ great scenes film twists 1 1 1 1
was 0 2 0 1
worst 0 1 0 1
NB Binary
Counts Counts
film 1 0 1 0
or 0 1 0 1
plot 1 1 1 1
was 0 2 0 1
worst 0 1 0 1
NB Binary
Counts Counts
film 1 0 1 0
or 0 1 0 1
plot 1 1 1 1
was 0 2 0 1
worst 0 1 0 1
NB Binary
Counts Counts
film 1 0 1 0
or 0 1 0 1
plot 1 1 1 1
was 0 2 0 1
Counts can still be 2! Binarization is within-doc! worst 0 1 0 1
Text Sentiment and Binary
Classification Naive Bayes
and Naive
Bayes
Text More on Sentiment
Classification Classification
and Naive
Bayes
Sentiment Classification: Dealing with Negation
I really like this movie
I really don't like this movie
Negation changes the meaning of "like" to negative.

Negation can also change negative to positive-ish
◦ Don't dismiss this film
◦ Doesn't let us get bored
Sentiment Classification: Dealing with Negation
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In
Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. EMNLP-2002, 79—86.
Simple baseline method:

Add NOT_ to every word between negation and following punctuation:
didn’t like this movie , but I
didn’t NOT_like NOT_this NOT_movie but I

Sentiment Classification: Lexicons
Sometimes we don't have enough labeled training
data
In that case, we can make use of pre-built word lists
Called lexicons
There are various publically available lexicons
MPQA Subjectivity Cues Lexicon
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

6885 words from 8221 lemmas, annotated for intensity (strong/weak)
◦ 2718 positive
◦ 4912 negative
+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
− : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate
55
The General Inquirer
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General
Inquirer: A Computer Approach to Content Analysis. MIT Press
◦ Home page: http://www.wjh.harvard.edu/~inquirer

◦ List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
◦ Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
Categories:
◦ Positiv (1915 words) and Negativ (2291 words)
◦ Strong vs Weak, Active vs Passive, Overstated versus Understated
◦ Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
Free for Research Use
Using Lexicons in Sentiment Classification
Add a feature that gets a count whenever a word
from the lexicon occurs
◦ E.g., a feature called "this word occurs in the positive
lexicon" or "this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful,
wonderful) or negative words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of the
test set, dense lexicon features can help
Naive Bayes in Other tasks: Spam Filtering
SpamAssassin Features:
◦ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
◦ From: starts with many numbers
◦ Subject is all capitals
◦ HTML has a low ratio of text to image area
◦ "One hundred percent guaranteed"
◦ Claims you can be removed from the list
Naive Bayes in Language ID
Determining what language a piece of text is written in.
Features based on character n-grams do very well
Important to train on lots of varieties of each language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Summary: Naive Bayes is Not So Naive
Very Fast, low storage requirements
Work well with very small amounts of training data
Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Very good in domains with many equally important features

Decision Trees suffer from fragmentation in such cases – especially if little data
Optimal if the independence assumptions hold: If assumed

independence is correct, then it is the Bayes Optimal Classifier for problem
A good dependable baseline for text classification
◦ But we will see other classifiers that give better accuracy
Slide from Chris Manning
Text More on Sentiment
Classification Classification
and Naive
Bayes
Text
Classification Naïve Bayes: Relationship
and Naïve to Language Modeling
Bayes
Generative Model for Multinomial Naïve Bayes
c=+
X1=I X2=love X3=this X4=fun X5=film
63
Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
◦ URL, email address, dictionaries, network features
But if, as in the previous slides
◦ We use only word features
◦ we use all of the words in the text (not a subset)
Then
◦ Naive bayes has an important similarity to language
modeling.
64
Sec.13.2.1
Each class = a unigram language model
Assigning each word: P(word | c)

Assigning each sentence: P(s|c)=Π P(word|c)
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
Model pos Model neg

0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1
0.05 fun 0.005 fun

0.1 film 0.1 film P(s|pos) > P(s|neg)
Text
Classification Naïve Bayes: Relationship
and Naïve to Language Modeling
Bayes
Text
Classification Precision, Recall, and F
and Naïve measure
Bayes
Evaluation
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about
your pies
So you build a "Delicious Pie" tweet detector
◦ Positive class: tweets about Delicious Pie Co
◦ Negative class: all other tweets
oesn’t work well when the classes are unbalanced (as indeed they are with sp
which is aThe
large 2-by-2
majority ofconfusion
email, or with matrix
tweets, which are mainly not about pi
gold standard labels

gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn
Figure 4.4 A confusion matrix for visualizing how well a binary classification system
orms against gold standard labels.
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
◦ 100 of them talked about Delicious Pie Co.
◦ 999,900 talked about something else
We could build a dumb classifier that just labels every
tweet "not about pie"
◦ It would get 99.99% accuracy!!! Wow!!!!
◦ But useless! Doesn't return the comments we are looking for!
◦ That's why we use precision and recall instead
common situation in the world.
insteadEvaluation:
of accuracyPrecision
we generally turn to two other metric
sion and recall. Precision measures the percentage of the
% of items the system detected (i.e., items the
ected (i.e., the system labeled as positive) that are in fact po
system labeled as positive) that are in fact positive
cording to the human gold labels). Precision
(according to the human gold labels) is defined as
true positives
Precision =
true positives + false positives
asures the percentage of items actually present in the inpu
fied by the system. Recall is defined as
Evaluation: Recall true positives
Precision =
true positives + false positives
% of items actually present in the input that were
sures correctly
the percentage of items actually
identified by the system. present in the inpu
fied by the system. Recall is defined as
true positives
Recall =
true positives + false negatives
nd recall will help solve the problem with the useless “

Why Precision and recall
Our dumb pie-classifier
◦ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
◦ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true
positives:
◦ finding the things that we are supposed to be looking for.
es: finding
re are the things
many ways that
to define we are
a single supposed
metric to be looking
that incorporates foro
aspects
re many
n and ways
recall. to
The define
simplest aofsingle
these metric that
combinations is the F-measure
incorporates asp
gen,
A combined
1975) , defined as:
measure: F
nd recall. The simplest of these combinations is the F-m
1975)F ,measure:
defined as: 2
(b + 1)PR
a single number that combines P and R:
Fb =
b 22P + R
(b + 1)PR
Fb =
b parameter differentially weights the importance of recall and prec
b P+R
2
erhaps on the needs of an application. Values of b > 1 favor recall,
parameter
of b < 1 favordifferentially weights
precision. When b = 1,the importance
precision of recall
and recall are equallan
We
the almost
most always
frequently use
used balanced
metric, andF1is(i.e.,
aps on the needs of an application. Values ofbb=1 >
his is calledb =
F 1) or 1just F1 :
favor
< 1 favor precision. When b2PR= 1, precision and recall are
F1 =metric, and is called F
is the most frequently used or jus
P+R b =1
Development Test Sets ("Devsets") and Cross-validation
Training set Development Test Set Test Set
Train on training set, tune on devset, report on testset

◦ This avoids overfitting (‘tuning to the test set’)
◦ More conservative estimate of performance
◦ But paradox: want as much data as possible for training, and as
much for dev; how to split?
mance of our system. However, looking at the corpus to understand what’s going
on is important in designing NLP systems! What to do? For this reason, it is com-
Cross-validation: multiple splits

mon to create a fixed training set and test set, then do 10-fold cross-validation inside
the training set, but compute error rate the normal way in the test set, as shown in
Fig. 4.7.
Pool results over splits, Compute pooled dev performance
Training Iterations Testing
1 Dev Training
2 Dev Training
3 Dev Training
4 Dev Training
Test
5 Training Dev Training
Set
6 Training Dev
7 Training Dev
8 Training Dev
9 Training Dev
10 Training Dev
Text
Classification Precision, Recall, and F
and Naive measure
Bayes
Text
Classification Evaluation with more than
and Naive two classes
Bayes
ve Bayes algorithm is already a multi-class classification algorithm.
Confusion Matrix for 3-class classification
gold labels
urgent normal spam
8
urgent 8 10 1 precisionu=
8+10+1
system 60
output normal 5 60 50 precisionn=
5+60+50
200
spam 3 30 200 precisions=
3+30+200
recallu = recalln = recalls =
8 60 200
8+5+3 10+60+30 1+50+200
How to combine P/R from 3 classes to get one metric
Macroaveraging:
◦ compute the performance for each class, and then
average over classes
Microaveraging:
◦ collect decisions for all classes into one confusion matrix
◦ compute precision and recall from that table.
14 C
Macroaveraging
4 • N
HAPTER B
and
S
Microaveraging
C
AIVE AYES AND ENTIMENT LASSIFICATION
Class 1: Urgent Class 2: Normal Class 3: Spam Pooled

true true true true true true true true
urgent not normal not spam not yes no
system system system system
urgent 8 11 normal 60 55 spam 200 33 yes 268 99
system system system system
not 8 340 not 40 212 not 51 83 no 99 635
8 60 200 microaverage = 268
precision = = .42 precision = = .52 precision = = .86 = .73
8+11 60+55 200+33 precision 268+99
macroaverage = .42+.52+.86
= .60
precision 3
Figure 4.6 Separate confusion matrices for the 3 classes from the previous figure, showing the pooled confu
sion matrix and the microaveraged and macroaveraged precision.
Text
Classification Evaluation with more than
and Naive two classes
Bayes
Text
Classification Statistical Significance
and Naive Testing
Bayes
How do we know if one classifier is better
than another?
Given:
◦ Classifier A and B
◦ Metric M: M(A,x) is the performance of A on testset x
◦ 𝛿(x): the performance difference between A, B on x:
◦ 𝛿(x) = M(A,x) – M(B,x)
◦ We want to know if 𝛿(x)>0, meaning A is better than B

◦ 𝛿(x) is called the effect size
◦ Suppose we look and see that 𝛿(x) is positive. Are we done?
◦ No! This might be just an accident of this one test set, or
circumstance of the experiment. Instead:
We need something more: we want to know if A’s superiorit
again if we checked another test set x0 , or under some other
othesisStatistical Hypothesis Testing
In thenull
H0 , called the hypothesis,
paradigm of statisticalsupposes that dwe
hypothesis testing, (x)testis
hypotheses.
ero, meaning that A
Consider two hypotheses: is not better than B. We would like to k
ntly rule◦ out
Null this hypothesis,
hypothesis: and instead
A isn't better than B supportH0 : d (x) H10, that A is
do this ◦by A iscreating
better than a random
B variable X Hranging 1 : d (x) >over
0 all tes
likelyWe is want
it, if
null hypothesis
to
the rule
null out H
hypothesis
0 H 0 was correct, that
The hypothesis H0 , called the null hypothesis, supposes th amon
ld encounter
We create theative
value
random
or zero, d (x) that
of variable
meaning we
A is found.
X ranging
that overWe
not better testformalize
than sets
B. We would
-value:And the
ask, confidently
probability, rule
how likely, assuming out this
if H0 is true,thehypothesis,
is itnull and instead
thathypothesis
among thesesupport H
H0 is1
test sets we We
would do this
see by
the creating
𝛿(x) a
we random
did variable
see? X ranging ov
that we saw or ask onehow even greater
likely is it, if the null hypothesis H0 was correct, th
• Formalized as the
we would p-value:
encounter the value of d (x) that we found. We f
p-value P(d
as the (X) thed probability,
p-value: (x)|H0 is assuming
true) the null hypothe
the d (x) that we saw or one even greater
the p-value: the probability, assuming the null hypothesis H0 is true, of
Statistical
e d (x) that Hypothesis
we saw or one even greater Testing
P(d (X) d (x)|H0 is true)

◦ In our example,
in our example, this p-valuethisisp-value is the probability
the probability that wesee
that we would would
d (x) ass
s not betterseethan
δ(x)B. assuming
If d (x) His0 (=A
hugeis (let’s
not better
say Athan
has B).
a very respectable F
◦ If H0 is true but δ(x) is huge, that is surprising! Very low probability!
d B has a terrible F1 of only .2 on x), we might be surprised, since that wo
◦ A very small p-value means that the difference we
remely unlikely to occur if H0 were in fact true, and so the p-value would observed
is very unlikely under the null hypothesis, and we can reject
nlikely to have such a large
the null hypothesis d if A is in fact not better than B). But if d (x) i
all, it might be less surprising to us even if H0 were true and A is not really
◦ Very small: .05 or .01
an B, and so the p-value would be higher.
◦ A result(e.g., “A is better than B”) is statistically significant
A very small p-value means that the difference we observed
if the δ we saw has a probability that is below the threshold is very un
der the nullandhypothesis, and we can reject the
we therefore reject this null hypothesis.null hypothesis. What counts a
Statistical Hypothesis Testing
◦ How do we compute this probability?
◦ In NLP, we don't tend to use parametric tests (like t-tests)
◦ Instead, we use non-parametric tests based on sampling:
artificially creating many versions of the setup.
◦ For example, suppose we had created zillions of testsets x'.
◦ Now we measure the value of 𝛿(x') on each test set
◦ That gives us a distribution
◦ Now set a threshold (say .01).
◦ So if we see that in 99% of the test sets 𝛿(x) > 𝛿(x')
◦ We conclude that our original test set delta was a real delta and not an artifact.
Statistical Hypothesis Testing
Two common approaches:
◦ approximate randomization
◦ bootstrap test
Paired tests:
◦ Comparing two sets of observations in which each observation
in one set can be paired with an observation in another.
◦ For example, when looking at systems A and B on the same
test set, we can compare the performance of system A and B
on each same observation xi
Text
Classification Statistical Significance
and Naive Testing
Bayes
Text
Classification The Paired Bootstrap Test
and Naive
Bayes
Bootstrap test
Efron and Tibshirani, 1993
Can apply to any metric (accuracy, precision, recall,

F1, etc).
Bootstrap means to repeatedly draw large numbers
of smaller samples with replacement (called
bootstrap samples) from an original larger sample.
Bootstrap example
Consider a baby text classification example with a test

set x of 10 documents, using accuracy as metric.
Suppose these are the results of systems A and B on x,
with 4 outcomes (A & B both right, A & B both wrong,
4.9 • S TATISTICAL S IGNIFICANCE T ESTING 17
A right/B wrong, A wrong/B right):
1either
2 A+B
3 both
4 5correct,
6 7 or 8 9 10 A% B% d ()
x AB AB
◆ AB AB AB
◆ AB AB
◆ AB AB
◆ AB
◆ .70 .50 .20
x(1) AB
◆ AB AB
◆ AB AB AB
◆ AB AB AB
◆ AB .60 .60 .00
x (2) AB
◆ AB AB
◆ AB AB AB AB AB
◆ AB AB .60 .70 -.10
Bootstrap example
Now we create, many, say, b=10,000 virtual test sets x(i),
each of size n = 10.
To make each x(i), we randomly select a cell from row x,
with replacement,4.910• times:
S TATISTICAL S IGNIFICANCE T ESTING 17
1 2 3 4 5 6 7 8 9 10 A% B% d ()
x AB AB◆ AB AB AB◆ AB AB◆ AB AB◆ AB
◆ .70 .50 .20
x (1) AB AB AB AB AB AB AB AB AB AB .60 .60 .00
◆ ◆ ◆ ◆
x (2) AB AB AB AB AB AB AB AB AB AB .60 .70 -.10
◆ ◆ ◆
...
x(b)
(i)
the correct or incorrect performance of classifiers A and B. Of course real test sets don’t hav
only 10 examples, and b needs to be large as well.
Bootstrap example
Now that we have the b test sets, providing a sampling distribution, we can do
Now
statistics on we
how have
often Aa has
distribution!
an accidental We can check
advantage. how
There are often
various A to
ways
computehasthis
anadvantage;
accidental hereadvantage,
we follow theto version
see iflaid
theoutoriginal
in Berg-Kirkpatrick
𝛿(x)
et al. (2012). Assuming H0 (A isn’t better than B), we would expect that d (X), esti
we saw was very common.
mated over many test sets, would be zero; a much higher value would be surprising
since Now assuming
H0 specifically H0, Athat
assumes isn’t means normally
better than we exactly
B. To measure expecthow surpris
our observed d (x) we would in other circumstances compute the p-value by
ing is 𝛿(x')=0
counting over many test sets how often d (x(i) ) exceeds the expected zero value by
Somore:
d (x) or we just count how many times the 𝛿(x') we found
exceeds the expected 0 value by 𝛿(x) or more:
b
X ⇣ ⌘
p-value(x) = d (x )
(i)
d (x) 0
i=1
p-value(x) = d (x(i) ) d (x) 0
Bootstrap example i=1
ever, although it’s generally true that the expected value of d (X) over many test
Alas, it's slightly more complicated.
(again assuming A isn’t better than B) is 0, this isn’t true for the bootstrapped
We didn’t
ets we created. That’sdraw these
because wesamples from
didn’t draw a distribution
these samples fromwith 0 mean; we
a distribution
0 mean; wecreated themtofrom
happened createthe
themoriginal testoriginal
from the set x, which
test sethappens to be biased
x, which happens
(by .20) in favor of A.
biased (by .20) in favor of A. So to measure how surprising is our observed
So tocompute
we actually measurethehow surprising
p-value is our over
by counting observed
manyδ(x), we actually
test sets how often
) exceedscompute the p-value
the expected value ofby counting
d (x) by d (x)how often δ(x') exceeds the expected
or more:
value of δ(x) by δ(x) or more:
b
X ⇣ ⌘
p-value(x) = d (x )
(i)
d (x) d (x)
i=1
b
X ⇣ ⌘
= d (x(i) ) 2d (x) (4.22)
i=1
Bootstrap example
Suppose:
◦ We have 10,000 test sets x(i) and a threshold of .01
◦ And in only 47 of the test sets do we find that δ(x(i)) ≥
2δ(x)
◦ The resulting p-value is .0047
◦ This is smaller than .01, indicating δ (x) is indeed
sufficiently surprising
◦ And we reject the null hypothesis and conclude A is
better than B.
APTER 4 Paired
• NAIVE bootstrap example
BAYES AND S ENTIMENT C LASSIFICATION
After Berg-Kirkpatrick et al (2012)
function B OOTSTRAP(test set x, num of samples b) returns p-value(x)
Calculate d (x) # how much better does algorithm A do than B on x

s=0
for i = 1 to b do
for j = 1 to n do # Draw a bootstrap sample x(i) of size n
Select a member of x at random and add it to x(i)
Calculate d (x(i) ) # how much better does algorithm A do than B on x(i)
s s + 1 if d (x(i) ) > 2d (x)
p-value(x) ⇡ bs # on what % of the b samples did algorithm A beat expectations?
return p-value(x) # if very few did, our observed d is probably not accidental
Figure 4.9 A version of the paired bootstrap algorithm after Berg-Kirkpatrick et al. (2012
Text
Classification The Paired Bootstrap Test
and Naive
Bayes
Text
Classification Avoiding Harms in
and Naive Classification
Bayes
Harms in sentiment classifiers
Kiritchenko and Mohammad (2018) found that most
sentiment classifiers assign lower sentiment and
more negative emotion to sentences with African
American names in them.
This perpetuates negative stereotypes that
associate African Americans with negative emotions
Harms in toxicity classification
Toxicity detection is the task of detecting hate speech,
abuse, harassment, or other kinds of toxic language
But some toxicity classifiers incorrectly flag as being
toxic sentences that are non-toxic but simply mention
identities like blind people, women, or gay people.
This could lead to censorship of discussion about these
groups.
What causes these harms?
Can be caused by:
◦ Problems in the training data; machine learning systems
are known to amplify the biases in their training data.
◦ Problems in the human labels
◦ Problems in the resources used (like lexicons)
◦ Problems in model architecture (like what the model is
trained to optimized)
Mitigation of these harms is an open research area
Meanwhile: model cards
Model Cards
(Mitchell et al., 2019)
For each algorithm you release, document:

◦ training algorithms and parameters
◦ training data sources, motivation, and preprocessing
◦ evaluation data sources, motivation, and preprocessing
◦ intended use and users
◦ model performance across different demographic or
other groups and environmental situations
Text
Classification Avoiding Harms in
and Naive Classification
Bayes

4 NB Apr 4 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 NB Apr 4 2021

Uploaded by

Copyright:

Available Formats

Text The Task of Text Classification

James Madison Alexander Hamilton

+ ...zany characters and richly applied satire, and some great

...awesome caramel sauce and sweet toasty almonds. I

− ...awful pizza and ridiculously overpriced...

+ ...zany characters and richly applied satire, and some great

...awesome caramel sauce and sweet toasty almonds. I

− ...awful pizza and ridiculously overpriced...

Movie: is this review positive or negative?

Sentiment analysis is the detection of

Output: a predicted class c Î C

Rules based on combinations of words or other features

Simple ("naive") classification method based on

•For a document d and a class c

cMAP = argmax P(d | c)P(c)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

O(|X|n•|C|) parameters How often does this

Could only be estimated if a

Bag of Words assumption: Assume position doesn’t matter

P(x1,…, xn | c) = P(x1 | c)• P(x2 | c)• P(x3 | c)•...• P(xn | c)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)

cNB = argmax P(c j )∏ P(x | c)

positions ¬ all word positions in test document

cNB = argmax P(c j )

cNB = argmax P(c j )

Multiplying lots of probabilities can result in floating-point underflow!

Learning the Multinomial Naive Bayes Model

First attempt: maximum likelihood estimates

Create mega-document for topic j by concatenating all

Problem with Maximum Likelihood

What if we have seen no training documents with the word fantastic

Zero probabilities cannot be conditioned away, no matter the other

• From training corpus, extract Vocabulary

But removing stop words doesn't usually help

cNB = argmax P(c j )

Negation changes the meaning of "like" to negative.

Simple baseline method:

didn’t like this movie , but I

didn’t NOT_like NOT_this NOT_movie but I

Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

◦ Home page: http://www.wjh.harvard.edu/~inquirer

Very good in domains with many equally important features

Optimal if the independence assumptions hold: If assumed

X1=I X2=love X3=this X4=fun X5=film

Each class = a unigram language model

Assigning each word: P(word | c)

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

Model pos Model neg

0.05 fun 0.005 fun

gold standard labels

nd recall will help solve the problem with the useless “

Training set Development Test Set Test Set

Train on training set, tune on devset, report on testset

Cross-validation: multiple splits

Class 1: Urgent Class 2: Normal Class 3: Spam Pooled

◦ We want to know if 𝛿(x)>0, meaning A is better than B

P(d (X) d (x)|H0 is true)

Can apply to any metric (accuracy, precision, recall,

Consider a baby text classification example with a test

Bootstrap example i=1

function B OOTSTRAP(test set x, num of samples b) returns p-value(x)

Calculate d (x) # how much better does algorithm A do than B on x