You are on page 1of 115

Natural Language Processing

CPSC 436N
Term 1
Lecture 5: Text Cl ssi ic tion 1 - Tr dition l Appro ches
Technology

Sports

Instructor: Vered Shw rtz


https://www.cs.ubc.c /~vshw rtz Politics

1
a
a
f
a
a
a
a
a
a
LM Evaluation: Goal
You may want to compare the performance (on the same test corpus):
• 2-grams with 3-grams
• two different smoothing techniques (given the same n-grams)
• Count-based vs. neural language models

2
Model Evaluation: Key Ideas
Corpus

A:split
Training Set Testing set w
1
N

B: train models C:Apply models


• Count-based • Compare
• Neural Models: Q1 and Q2 results

3
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1

4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1

− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model

4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1

− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model

1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N

4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1

− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model

1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N
Optional reading: chapter 3, section 3.8 J&M

4
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression

5
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression

6
f
Text Classi ication: De inition
(aka Text Categorization)

7
f
f
Text Classi ication: De inition
(aka Text Categorization)

Input:
• A document d

7
f
f
Text Classi ication: De inition
(aka Text Categorization)

Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}

7
f
f
f
Text Classi ication: De inition
(aka Text Categorization)

Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}

Output: a predicted class c ∈ C for document d Technology

7
f
f
f
Topic Classi ication
E.g. what is the topic of this medical article?

d = MEDLINE Article C = MeSH Subject Category Hierarchy

8
f
Topic Classi ication
E.g. what is the topic of this medical article?

d = MEDLINE Article C = MeSH Subject Category Hierarchy


• Antogonists and Inhibitors
• Blood Supply
• Chemistry
• Drug Therapy
• Embryology
• Epidemiology
•…

8
f
Spam Detection
d = email C = {spam, not spam}

9
Authorship Attribution
E.g. who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince New York to ratify U.S
Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1964: “solved” by Mosteller and Wallace using Bayesian methods

d = essay

C = {Jay, Madison, Hamilton}


10
Analyzing Author Attributes
E.g. is this person suffering from pre-symptomatic Alzheimer ?

d = Person’s description of the picture


Standard Test “Cookie Theft”

C = {healthy, sick}

“there's a child climbing up getting cookies out of a cookie jar it's a


boy and the girl is standing on the and his er stool is tipping over
under him and his the girl is standing on the floor with her hand help
up for a cookie you remember I lived in America so I use cookie” 11
Analyzing Author Attributes
E.g. is this person suffering from pre-symptomatic Alzheimer ?

d = Person’s description of the picture


Standard Test “Cookie Theft”

C = {healthy, sick}

Can only be detected


with costly MRI
“there's a child climbing up getting cookies out of a cookie jar it's a
boy and the girl is standing on the and his er stool is tipping over
under him and his the girl is standing on the floor with her hand help
up for a cookie you remember I lived in America so I use cookie” 11
Language Identi ication

d = Text

C = {Afrikaans, Albanian, Amharic, …,


Hebrew, …, Yoruba, Zulu}

12
f
Sentiment Analysis
E.g. what sentiment does this review express towards the movie?

d = Review
It could have been a great movie. It does have
C = {positive, negative}
beautiful scenery, some of the best since
Lord of the Rings. The acting is well done,
and I really like the sone of the leader of the
Samurai. He was a likeble chap, and I hated
to see him die. But, other than that, this movie C = {⭐ , ⭐ ⭐ , ⭐ ⭐ ⭐ , ⭐ ⭐ ⭐ ⭐ ,⭐ ⭐ ⭐ ⭐ ⭐ }
is nothing more that hidden rip-offs.

13
Other Examples of Sentiment Analysis

14
Other Examples of Sentiment Analysis

• Restaurants: is this review positive or negative?

14
Other Examples of Sentiment Analysis

• Restaurants: is this review positive or negative?


• Products: what do people think about the new iPhone?

14
Other Examples of Sentiment Analysis

• Restaurants: is this review positive or negative?


• Products: what do people think about the new iPhone?
• Public sentiment: how is consumer con idence?

14
f
Other Examples of Sentiment Analysis

• Restaurants: is this review positive or negative?


• Products: what do people think about the new iPhone?
• Public sentiment: how is consumer con idence?
• Politics: what do people think about this candidate or issue?

14
f
Other Examples of Sentiment Analysis

• Restaurants: is this review positive or negative?


• Products: what do people think about the new iPhone?
• Public sentiment: how is consumer con idence?
• Politics: what do people think about this candidate or issue?
• Prediction: predict election outcomes or market trends from sentiment

14
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}

Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…

Output: a learned classi ier (model) γ

15
f
f
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}

Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…

Output: a learned classi ier (model) γ

Inference: γ(d) = ci
Input: a document d
Output: a class ci
15
f
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression

16
f
Justi ication
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).

Convolutional Neural Networks More popular in vision.


Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.

17
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
• Features for Sentiment Analysis

18
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd)

19
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class

19
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class

Simple ("naive") classi ication method based on Bayes rule


P(dd | c)P(c)
c c
= argmaxc∈C Bayes rule
P(d)d

19
f
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class

Simple ("naive") classi ication method based on Bayes rule


P(dd | c)P(c)
c c
= argmaxc∈C Bayes rule
P(d)d
= argmaxc∈C P(dd | c)P(c)
c c Constant denominator

19
f
f
Naïve Bayes Classi ier
"Likelihood" "Prior"

c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c

20
f
Naïve Bayes Classi ier
"Likelihood" "Prior"

c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c

O(|X|n•|C|) parameters - a probability distribution for each combination of


c ∈ C and each con iguration of the n features.
|X|n can be very large, e.g. n = 20 features with 10 possible values = 1020!
Can only be estimated given a very large number of training examples.

20
f
f
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c

21
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c

CMAP = argmaxc∈C P(x1, x2, . . . , xn | c)P(c)


c c O(|X|n•|C|) parameters

c x∈X P(x | cc)


CNB = argmaxc∈C P(c)Π O(n•|X|•|C|) parameters

21
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c

f
22
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c

d = “Best movie of the summer” C = {pos, neg}


scorepos = P(pos) ⋅ P(best | pos) ⋅ P(movie | pos) ⋅ P(of | pos) ⋅ P(the | pos) ⋅ P(summer | pos)
scoreneg = P(neg) ⋅ P(best | neg) ⋅ P(movie | neg) ⋅ P(of | neg) ⋅ P(the | neg) ⋅ P(summer | neg)

f
22
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c

d = “Best movie of the summer” C = {pos, neg}


scorepos = P(pos) ⋅ P(best | pos) ⋅ P(movie | pos) ⋅ P(of | pos) ⋅ P(the | pos) ⋅ P(summer | pos)
scoreneg = P(neg) ⋅ P(best | neg) ⋅ P(movie | neg) ⋅ P(of | neg) ⋅ P(the | neg) ⋅ P(summer | neg)

Problem: Multiplying probabilities may result in loating-point under low!


E.g.: .0006 * .0007 * .0009 * .01 * .5 * .000008…
Solution: sum the logs probabilities (log(ab) = log(a) + log(b)).
E.g.: log(.0006) + log(.0007) + log(.0009) + log(.01) + log(.5) + log(.000008)…
f
22
f
f
Naïve Bayes for Text Classi ication

Notes:
1) Taking log doesn't change the ranking of classes - the class with
highest probability also has highest log probability.
2) It's a linear model: just a max of a sum of weights: a linear
combination of the inputs. So naïve bayes is a linear classi ier

f
23

f
Sec.13.3

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

( )=
number of docs of class j
^
Total number of docs
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

( )=
number of docs of class j
^
Total number of docs

Cat
Train -
Documents
just plain boring
P( + ) = ?
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3

Training a Naïve Bayes Model


count(wi , c j )
P̂(wi | c j ) =
∑ count(w, cj )
w∈V

25
Sec.13.3

Training a Naïve Bayes Model


count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) = among all words in documents of class cj
∑ count(w, cj )
w∈V

25
Sec.13.3

Training a Naïve Bayes Model


count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) = among all words in documents of class cj
∑ count(w, cj )
w∈V

P(the | + ) = ?
Cat
Train -
Documents
just plain boring
A. 0
- entirely predictable and lacks energy
- no surprises and very few laughs B. 1/5
+ very powerful
+ the most fun film of the summer C. 2/9
Test ? predictable with no fun

25
Stop Words

26
Stop Words

Some systems remove stop words, very frequent


words like the and a.

26
Stop Words

Some systems remove stop words, very frequent


words like the and a.

Method 1: using a list of stop words

26
Stop Words

Some systems remove stop words, very frequent


words like the and a.

Method 1: using a list of stop words


Method 2: top k most frequent words in the training set

26
Stop Words

Some systems remove stop words, very frequent


words like the and a.

Method 1: using a list of stop words


Method 2: top k most frequent words in the training set

Note: removing stop words doesn't usually help, so in


practice most NB algorithms don’t remove them.
26
Unknown Words

27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).

27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).

• Solution: ignore them - remove them from the test document.

27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).

• Solution: ignore them - remove them from the test document.

• Aren’t smoothing and removing unknown words redundant?


• No! Smoothing makes sure that we have estimates for each
w ∈ V given each class, even if w (e.g. “worst”) didn’t appear in
class c (e.g. positive). Removing words from the test document
is a way to handle out of vocabulary words.

27
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
Train - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun

28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun

28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun

28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun

( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2
P(predictable | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2
P(no | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1
P(fun | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |

1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1 1+1 2
P(fun | − ) = = |
P(fun + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun

2⃣ Compute class posteriors:

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun


3 2 2 1
2⃣ Compute class posteriors: P( − | predictable with no fun) = ⋅ ⋅ ⋅ = 6.1 × 10−5
5 34 34 34

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun


3 2 2 1
2⃣ Compute class posteriors: P( − | predictable with no fun) = ⋅ ⋅ ⋅ = 6.1 × 10−5
5 34 34 34
2 1 1 2
P( + | predictable with no fun) = ⋅ ⋅ ⋅ = 3.2 × 10−5
5 29 29 29

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun


3 2 2 1
2⃣ Compute class posteriors: P( − | predictable with no fun) = ⋅ ⋅ ⋅ = 6.1 × 10−5
5 34 34 34
2 1 1 2
P( + | predictable with no fun) = ⋅ ⋅ ⋅ = 3.2 × 10−5
5 29 29 29
3⃣ Predict class with higher score

29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

1⃣ Remove unknown words Predictable with no fun


3 2 2 1
2⃣ Compute class posteriors: P( − | predictable with no fun) = ⋅ ⋅ ⋅ = 6.1 × 10−5
5 34 34 34
2 1 1 2
P( + | predictable with no fun) = ⋅ ⋅ ⋅ = 3.2 × 10−5
5 29 29 29
3⃣ Predict class with higher score

29
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression

30
f
Naïve Bayes and Language Modeling

31
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …

31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,

31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,
• Then Naïve Bayes is similar to language modeling:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,
• Then Naïve Bayes is similar to language modeling:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)
The highlighted formula may be interpreted as:
A. The bigram model for class cj
B. The unigram model for class cj
C. Neither A not B 31
f
Naïve Bayes and Language Modeling

32
Naïve Bayes and Language Modeling

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

32
Naïve Bayes and Language Modeling

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)


1. Build an LM for each class

32
Naïve Bayes and Language Modeling

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)


1. Build an LM for each class
2. During inference, predict a document’s class that
maximizes the product of the class prior P(cj) and the
probability given to the document by its LM
Πi=1,...,nP(wi | cj)

32
Naïve Bayes and Language Modeling

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)


1. Build an LM for each class
2. During inference, predict a document’s class that
maximizes the product of the class prior P(cj) and the
probability given to the document by its LM
Πi=1,...,nP(wi | cj)
Can we extended to bigrams:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | wi−1, cj)
32
Summary: Naïve Bayes is Not So Naïve

33
Summary: Naïve Bayes is Not So Naïve
• Very fast
• Low storage requirements
• Works well with small amounts of training data
• Interesting connection with LMs
• A good dependable baseline for text classi ication
• But we’ll see other classi iers with better accuracy

33
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression

34
f
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).

Convolutional Neural Networks More popular in vision.


Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.

35
f
Naïve Bayes Introduce terminology and common representations.
Generative
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Discriminative Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).

Convolutional Neural Networks More popular in vision.


Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.

35
f
Generative and Discriminative Classi iers

Classify an image to C = {dog, cat}

36

f
Generative Classi ier

😼 🐕
Cat Model Dog Model

37
f
Generative Classi ier

😼 🐕
Cat Model Dog Model

What's in a cat image? What's in a dog image?


Training
E.g. whiskers, ears, eyes E.g. tail, ears, eyes

37
f
Generative Classi ier

😼 🐕
Cat Model Dog Model

What's in a cat image? What's in a dog image?


Training
E.g. whiskers, ears, eyes E.g. tail, ears, eyes

Run both models on the image.


Test
how cat-y is this image? how dog-y is this image?

37
f
Generative Classi ier

😼 🐕
Cat Model Dog Model

What's in a cat image? What's in a dog image?


Training
E.g. whiskers, ears, eyes E.g. tail, ears, eyes

Run both models on the image.


Test
how cat-y is this image? how dog-y is this image?
Which one its better? cNB = argmax P(cj )∏ P(x | c)
c∈C x∈X
37
f
f
Discriminative Classi ier
Just try to distinguish dogs from cats

38
f
Discriminative Classi ier
Just try to distinguish dogs from cats

38
f
Discriminative Classi ier
Just try to distinguish dogs from cats

Dogs are wearing collars and cats aren’t

38
f
Text Classi ication in Generative & Discriminative Models

Naive Bayes Prior Likelihood

c d | c)
c ̂ = argmaxc∈C P(c)P(d c
c x∈X P(x | cc)
= argmaxc∈C P(c)Π

Logistic Regression Posterior

c | dd)
c ̂ = argmaxc∈C P(c
c | x1, . . . , xn)
= argmaxc∈C P(c

39
f
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

40
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

• xi ="review contains ‘awesome’": wi = +10


• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2

40
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

1⃣ Remove unknown words Predictable with no fun

41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

1⃣ Remove unknown words Predictable with no fun

2⃣ Represent as a feature vector: x = [x1, x2,…, xn]


just plain
… predictable … no fun
… Summer

0 0 0 1 0 1 1 0 0 bag-of-words

41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?

1⃣ Remove unknown words Predictable with no fun

2⃣ Represent as a feature vector: x = [x1, x2,…, xn]


just plain
… predictable … no fun
… Summer

0 0 0 1 0 1 1 0 0 bag-of-words

3⃣ Compute class posteriors:


z = ΣiWi ⋅ xi + b = Wx + b
41
Predicting c ̂

z = w·x+ b

Can we treat z as a proxy for P( + | x1, . . . , xn)?


A. Yes
B. No

42
Sec.13.3

Training a Logistic Regression Model


Supervised classi ication:
For each example in the training set:
1. Represent as a feature vector xi
2. Predict the positive class probability σ(Wxi + b)
3. Compute the loss - distance between gold class cj and predicted class cĵ
4. Update the parameters W and b to minimize the loss using an optimization
algorithm (not in 436N).

43
f
Multiclass (Multinomial) Logistic Regression

exp(zi)
softmax(zi) = K
Σi=1 exp(zi) 44
For Next Time

• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• MLP Classi ier - Chapter 7, section 7.4

• Quiz 2 – due September 28 @11:59pm

45
f

You might also like