Lecture5 421

Natural Language Processing
CPSC 436N
Term 1
Lecture 5: Text Cl ssi ic tion 1 - Tr dition l Appro ches
Technology
Sports
Instructor: Vered Shw rtz

https://www.cs.ubc.c /~vshw rtz Politics
1
a
a
f
a
a
a
a
a
a
LM Evaluation: Goal
You may want to compare the performance (on the same test corpus):
• 2-grams with 3-grams
• two different smoothing techniques (given the same n-grams)
• Count-based vs. neural language models
2
Model Evaluation: Key Ideas
Corpus
A:split
Training Set Testing set w
1
N
B: train models C:Apply models

• Count-based • Compare
• Neural Models: Q1 and Q2 results
3
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
test set test set
according to Q1
− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
test set test set
according to Q1
− N1 1
ΠNi=1
1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
test set test set
according to Q1
− N1 1
ΠNi=1
1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N
Optional reading: chapter 3, section 3.8 J&M
4
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
5
f
Today
– Naïve Bayes
– Relation to LM
6
f
Text Classi ication: De inition
(aka Text Categorization)
7
f
f
Input:
• A document d
7
f
f
Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}
7
f
f
f
Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}
Output: a predicted class c ∈ C for document d Technology
7
f
f
f
Topic Classi ication
E.g. what is the topic of this medical article?
d = MEDLINE Article C = MeSH Subject Category Hierarchy
8
f
Topic Classi ication
E.g. what is the topic of this medical article?
d = MEDLINE Article C = MeSH Subject Category Hierarchy

• Antogonists and Inhibitors
• Blood Supply
• Chemistry
• Drug Therapy
• Embryology
• Epidemiology
•…
8
f
Spam Detection
d = email C = {spam, not spam}
9
Authorship Attribution
E.g. who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince New York to ratify U.S
Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1964: “solved” by Mosteller and Wallace using Bayesian methods
d = essay
C = {Jay, Madison, Hamilton}

10
Analyzing Author Attributes
E.g. is this person suffering from pre-symptomatic Alzheimer ?
d = Person’s description of the picture

Standard Test “Cookie Theft”
C = {healthy, sick}
“there's a child climbing up getting cookies out of a cookie jar it's a

boy and the girl is standing on the and his er stool is tipping over
under him and his the girl is standing on the floor with her hand help
up for a cookie you remember I lived in America so I use cookie” 11
Analyzing Author Attributes
E.g. is this person suffering from pre-symptomatic Alzheimer ?
d = Person’s description of the picture

Standard Test “Cookie Theft”
C = {healthy, sick}
Can only be detected

with costly MRI
“there's a child climbing up getting cookies out of a cookie jar it's a
boy and the girl is standing on the and his er stool is tipping over
under him and his the girl is standing on the floor with her hand help
up for a cookie you remember I lived in America so I use cookie” 11
Language Identi ication
d = Text
C = {Afrikaans, Albanian, Amharic, …,

Hebrew, …, Yoruba, Zulu}
12
f
Sentiment Analysis
E.g. what sentiment does this review express towards the movie?
d = Review
It could have been a great movie. It does have
C = {positive, negative}
beautiful scenery, some of the best since
Lord of the Rings. The acting is well done,
and I really like the sone of the leader of the
Samurai. He was a likeble chap, and I hated
to see him die. But, other than that, this movie C = {⭐ , ⭐ ⭐ , ⭐ ⭐ ⭐ , ⭐ ⭐ ⭐ ⭐ ,⭐ ⭐ ⭐ ⭐ ⭐ }
is nothing more that hidden rip-offs.
13
Other Examples of Sentiment Analysis
14
• Restaurants: is this review positive or negative?
14

• Products: what do people think about the new iPhone?
14

• Public sentiment: how is consumer con idence?
14
f

• Politics: what do people think about this candidate or issue?
14
f

• Politics: what do people think about this candidate or issue?
• Prediction: predict election outcomes or market trends from sentiment
14
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}
Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…
Output: a learned classi ier (model) γ
15
f
f
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}
Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…
Output: a learned classi ier (model) γ
Inference: γ(d) = ci
Input: a document d
Output: a class ci
15
f
f
f
Today
– Naïve Bayes
– Relation to LM
16
f
Justi ication
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
Convolutional Neural Networks More popular in vision.

Some related techniques critical in extending large neural LM to long text.
Competing with transformers in multimodal (vision + language) models.
17
f
f
Today
– Naïve Bayes
– Relation to LM
• Features for Sentiment Analysis
18
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd)
19
f
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class
19
f
Simple ("naive") classi ication method based on Bayes rule

P(dd | c)P(c)
c c
= argmaxc∈C Bayes rule
P(d)d
19
f
f
Simple ("naive") classi ication method based on Bayes rule

P(dd | c)P(c)
c c
= argmaxc∈C Bayes rule
P(d)d
= argmaxc∈C P(dd | c)P(c)
c c Constant denominator
19
f
f
Naïve Bayes Classi ier
"Likelihood" "Prior"
c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c
20
f
Naïve Bayes Classi ier
"Likelihood" "Prior"
c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c
O(|X|n•|C|) parameters - a probability distribution for each combination of

c ∈ C and each con iguration of the n features.
|X|n can be very large, e.g. n = 20 features with 10 possible values = 1020!
Can only be estimated given a very large number of training examples.
20
f
f
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c
21
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c
CMAP = argmaxc∈C P(x1, x2, . . . , xn | c)P(c)

c c O(|X|n•|C|) parameters
c x∈X P(x | cc)

CNB = argmaxc∈C P(c)Π O(n•|X|•|C|) parameters
21
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c
f
22
c i=1,...,|d| c
d = “Best movie of the summer” C = {pos, neg}

scorepos = P(pos) ⋅ P(best | pos) ⋅ P(movie | pos) ⋅ P(of | pos) ⋅ P(the | pos) ⋅ P(summer | pos)
scoreneg = P(neg) ⋅ P(best | neg) ⋅ P(movie | neg) ⋅ P(of | neg) ⋅ P(the | neg) ⋅ P(summer | neg)
f
22
c i=1,...,|d| c
d = “Best movie of the summer” C = {pos, neg}

scorepos = P(pos) ⋅ P(best | pos) ⋅ P(movie | pos) ⋅ P(of | pos) ⋅ P(the | pos) ⋅ P(summer | pos)
scoreneg = P(neg) ⋅ P(best | neg) ⋅ P(movie | neg) ⋅ P(of | neg) ⋅ P(the | neg) ⋅ P(summer | neg)
Problem: Multiplying probabilities may result in loating-point under low!

E.g.: .0006 * .0007 * .0009 * .01 * .5 * .000008…
Solution: sum the logs probabilities (log(ab) = log(a) + log(b)).
E.g.: log(.0006) + log(.0007) + log(.0009) + log(.01) + log(.5) + log(.000008)…
f
22
f
f
Notes:
1) Taking log doesn't change the ranking of classes - the class with
highest probability also has highest log probability.
2) It's a linear model: just a max of a sum of weights: a linear
combination of the inputs. So naïve bayes is a linear classi ier
f
23
f
Sec.13.3
Training a Naïve Bayes Model
Maximum likelihood estimates: use the frequencies in the data
( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3
( )=
number of docs of class j
^
Total number of docs
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3
( )=
number of docs of class j
^
Total number of docs
Cat
Train -
Documents
just plain boring
P( + ) = ?
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3

count(wi , c j )
P̂(wi | c j ) =
∑ count(w, cj )
w∈V
25
Sec.13.3

count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) = among all words in documents of class cj
∑ count(w, cj )
w∈V
25
Sec.13.3

count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) = among all words in documents of class cj
∑ count(w, cj )
w∈V
P(the | + ) = ?
Cat
Train -
Documents
just plain boring
A. 0
- no surprises and very few laughs B. 1/5
+ very powerful
+ the most fun film of the summer C. 2/9
25
Stop Words
26
Stop Words
Some systems remove stop words, very frequent

words like the and a.
26
Stop Words

Method 1: using a list of stop words
26
Stop Words


Method 2: top k most frequent words in the training set
26
Stop Words


Method 2: top k most frequent words in the training set
Note: removing stop words doesn't usually help, so in

practice most NB algorithms don’t remove them.
26
Unknown Words
27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).
27
Unknown Words
• Solution: ignore them - remove them from the test document.
27
Unknown Words
• Solution: ignore them - remove them from the test document.
• Aren’t smoothing and removing unknown words redundant?

• No! Smoothing makes sure that we have estimates for each
w ∈ V given each class, even if w (e.g. “worst”) didn’t appear in
class c (e.g. positive). Removing words from the test document
is a way to handle out of vocabulary words.
27
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
Train - just plain boring
+ very powerful
28
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
+ very powerful
28
Cat Documents
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
28
Cat Documents
+ very powerful
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2
P(predictable | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2 0+1 1
14 + 20 34 9 + 20 29
1+1 2
P(no | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2 0+1 1
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2 0+1 1
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1
P(fun | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Cat Documents
+ very powerful
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
(∑ ∈
( , )) + | |
1+1 2 0+1 1
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1 1+1 2
P(fun | − ) = = |
P(fun + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
1⃣ Remove unknown words
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
1⃣ Remove unknown words Predictable with no fun
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
2⃣ Compute class posteriors:
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

3 2 2 1
2⃣ Compute class posteriors: P( − | predictable with no fun) = ⋅ ⋅ ⋅ = 6.1 × 10−5
5 34 34 34
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

3 2 2 1
5 34 34 34
2 1 1 2
P( + | predictable with no fun) = ⋅ ⋅ ⋅ = 3.2 × 10−5
5 29 29 29
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

3 2 2 1
5 34 34 34
2 1 1 2
5 29 29 29
3⃣ Predict class with higher score
29
3 2 2 1
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29

3 2 2 1
5 34 34 34
2 1 1 2
5 29 29 29
3⃣ Predict class with higher score
29
Today
– Naïve Bayes
– Relation to LM
30
f
Naïve Bayes and Language Modeling
31
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
31
f
• If features = all of the words in the text,
31
f
• Then Naïve Bayes is similar to language modeling:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)
31
f
• Then Naïve Bayes is similar to language modeling:
The highlighted formula may be interpreted as:
A. The bigram model for class cj
B. The unigram model for class cj
C. Neither A not B 31
f
32
32

1. Build an LM for each class
32

2. During inference, predict a document’s class that
maximizes the product of the class prior P(cj) and the
probability given to the document by its LM
Πi=1,...,nP(wi | cj)
32

2. During inference, predict a document’s class that
maximizes the product of the class prior P(cj) and the
probability given to the document by its LM
Πi=1,...,nP(wi | cj)
Can we extended to bigrams:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | wi−1, cj)
32
Summary: Naïve Bayes is Not So Naïve
33
Summary: Naïve Bayes is Not So Naïve
• Very fast
• Low storage requirements
• Works well with small amounts of training data
• Interesting connection with LMs
• A good dependable baseline for text classi ication
• But we’ll see other classi iers with better accuracy
33
f
f
Today
– Naïve Bayes
– Relation to LM
34
f
Generalized to sequence models (CRFs).

35
f
Generative
Discriminative Generalized to sequence models (CRFs).

35
f
Generative and Discriminative Classi iers
Classify an image to C = {dog, cat}
36
f
Generative Classi ier
😼 🐕
Cat Model Dog Model
37
f
😼 🐕
Cat Model Dog Model
What's in a cat image? What's in a dog image?

Training
E.g. whiskers, ears, eyes E.g. tail, ears, eyes
37
f
😼 🐕
Cat Model Dog Model

Training
Run both models on the image.

Test
how cat-y is this image? how dog-y is this image?
37
f
😼 🐕
Cat Model Dog Model

Training
Run both models on the image.

Test
how cat-y is this image? how dog-y is this image?
Which one its better? cNB = argmax P(cj )∏ P(x | c)
c∈C x∈X
37
f
f
Discriminative Classi ier
Just try to distinguish dogs from cats
38
f
38
f
Dogs are wearing collars and cats aren’t
38
f
Text Classi ication in Generative & Discriminative Models
Naive Bayes Prior Likelihood
c d | c)
c ̂ = argmaxc∈C P(c)P(d c
c x∈X P(x | cc)
= argmaxc∈C P(c)Π
Logistic Regression Posterior
c | dd)
c ̂ = argmaxc∈C P(c
c | x1, . . . , xn)
= argmaxc∈C P(c
39
f
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
40
• xi ="review contains ‘awesome’": wi = +10

• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
40
41
41
2⃣ Represent as a feature vector: x = [x1, x2,…, xn]

just plain
… predictable … no fun
… Summer
0 0 0 1 0 1 1 0 0 bag-of-words
41
2⃣ Represent as a feature vector: x = [x1, x2,…, xn]

just plain
… predictable … no fun
… Summer
0 0 0 1 0 1 1 0 0 bag-of-words
3⃣ Compute class posteriors:

z = ΣiWi ⋅ xi + b = Wx + b
41
Predicting c ̂
z = w·x+ b
Can we treat z as a proxy for P( + | x1, . . . , xn)?

A. Yes
B. No
42
Sec.13.3
Training a Logistic Regression Model

Supervised classi ication:
For each example in the training set:
1. Represent as a feature vector xi
2. Predict the positive class probability σ(Wxi + b)
3. Compute the loss - distance between gold class cj and predicted class cĵ
4. Update the parameters W and b to minimize the loss using an optimization
algorithm (not in 436N).
43
f
Multiclass (Multinomial) Logistic Regression
exp(zi)
softmax(zi) = K
Σi=1 exp(zi) 44
For Next Time
• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• MLP Classi ier - Chapter 7, section 7.4
• Quiz 2 – due September 28 @11:59pm
45
f

Lecture5 421

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture5 421

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Instructor: Vered Shw rtz

B: train models C:Apply models

Output: a predicted class c ∈ C for document d Technology

d = MEDLINE Article C = MeSH Subject Category Hierarchy

d = MEDLINE Article C = MeSH Subject Category Hierarchy

C = {Jay, Madison, Hamilton}

d = Person’s description of the picture

“there's a child climbing up getting cookies out of a cookie jar it's a

d = Person’s description of the picture

Can only be detected

C = {Afrikaans, Albanian, Amharic, …,

• Restaurants: is this review positive or negative?

• Restaurants: is this review positive or negative?

• Restaurants: is this review positive or negative?

• Restaurants: is this review positive or negative?

• Restaurants: is this review positive or negative?

Output: a learned classi ier (model) γ

Output: a learned classi ier (model) γ

Convolutional Neural Networks More popular in vision.

Simple ("naive") classi ication method based on Bayes rule

Simple ("naive") classi ication method based on Bayes rule

O(|X|n•|C|) parameters - a probability distribution for each combination of

CMAP = argmaxc∈C P(x1, x2, . . . , xn | c)P(c)

c x∈X P(x | cc)

d = “Best movie of the summer” C = {pos, neg}

d = “Best movie of the summer” C = {pos, neg}

Problem: Multiplying probabilities may result in loating-point under low!

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

Training a Naïve Bayes Model

Maximum likelihood estimates: use the frequencies in the data

Training a Naïve Bayes Model

Training a Naïve Bayes Model

Training a Naïve Bayes Model

Some systems remove stop words, very frequent

Some systems remove stop words, very frequent

Method 1: using a list of stop words

Some systems remove stop words, very frequent

Method 1: using a list of stop words

Some systems remove stop words, very frequent

Method 1: using a list of stop words

Note: removing stop words doesn't usually help, so in

• Solution: ignore them - remove them from the test document.

• Solution: ignore them - remove them from the test document.

• Aren’t smoothing and removing unknown words redundant?

1⃣ Remove unknown words

1⃣ Remove unknown words Predictable with no fun

1⃣ Remove unknown words Predictable with no fun

1⃣ Remove unknown words Predictable with no fun

2⃣ Compute class posteriors:

1⃣ Remove unknown words Predictable with no fun

1⃣ Remove unknown words Predictable with no fun

1⃣ Remove unknown words Predictable with no fun

1⃣ Remove unknown words Predictable with no fun

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)

Convolutional Neural Networks More popular in vision.