Professional Documents
Culture Documents
Lecture5 421
Lecture5 421
CPSC 436N
Term 1
Lecture 5: Text Cl ssi ic tion 1 - Tr dition l Appro ches
Technology
Sports
1
a
a
f
a
a
a
a
a
a
LM Evaluation: Goal
You may want to compare the performance (on the same test corpus):
• 2-grams with 3-grams
• two different smoothing techniques (given the same n-grams)
• Count-based vs. neural language models
2
Model Evaluation: Key Ideas
Corpus
A:split
Training Set Testing set w
1
N
3
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1
− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1
− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model
1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N
4
Evaluation Metrics
Q1(s1 . . . sN ) ? Q2(s1 . . . sN )
Probability of the Sentences in the
test set test set
according to Q1
− N1 1
Perplexity PP(s1 . . . sN ) = Q(s1 . . . sN ) = N
ΠNi=1
Q(si | si−1) Bigram model
1
Cross entropy H(s1 . . . sN ) = − log Q(s1 . . . sN ) PP(s1 . . . sN ) = 2H(P,Q)
N
Optional reading: chapter 3, section 3.8 J&M
4
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
5
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
6
f
Text Classi ication: De inition
(aka Text Categorization)
7
f
f
Text Classi ication: De inition
(aka Text Categorization)
Input:
• A document d
7
f
f
Text Classi ication: De inition
(aka Text Categorization)
Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}
7
f
f
f
Text Classi ication: De inition
(aka Text Categorization)
Input:
• A document d
• A ixed set of classes C = {c1, c2,…, cJ} {Technology, Sports, Politics, …}
7
f
f
f
Topic Classi ication
E.g. what is the topic of this medical article?
8
f
Topic Classi ication
E.g. what is the topic of this medical article?
8
f
Spam Detection
d = email C = {spam, not spam}
9
Authorship Attribution
E.g. who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince New York to ratify U.S
Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1964: “solved” by Mosteller and Wallace using Bayesian methods
d = essay
C = {healthy, sick}
C = {healthy, sick}
d = Text
12
f
Sentiment Analysis
E.g. what sentiment does this review express towards the movie?
d = Review
It could have been a great movie. It does have
C = {positive, negative}
beautiful scenery, some of the best since
Lord of the Rings. The acting is well done,
and I really like the sone of the leader of the
Samurai. He was a likeble chap, and I hated
to see him die. But, other than that, this movie C = {⭐ , ⭐ ⭐ , ⭐ ⭐ ⭐ , ⭐ ⭐ ⭐ ⭐ ,⭐ ⭐ ⭐ ⭐ ⭐ }
is nothing more that hidden rip-offs.
13
Other Examples of Sentiment Analysis
14
Other Examples of Sentiment Analysis
14
Other Examples of Sentiment Analysis
14
Other Examples of Sentiment Analysis
14
f
Other Examples of Sentiment Analysis
14
f
Other Examples of Sentiment Analysis
14
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}
Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…
15
f
f
f
Classi ication Methods: Supervised Machine Learning
Given a ixed set of classes C = {c1, c2,…, cJ} {pos, neg}
Training:
Input: A training set of m labeled documents
(d1,c1),….,(dm,cm) (review1, neg) (review2, neg) (review3, pos)…
Inference: γ(d) = ci
Input: a document d
Output: a class ci
15
f
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
16
f
Justi ication
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
17
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
• Features for Sentiment Analysis
18
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd)
19
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class
19
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class
19
f
f
Naïve Bayes Intuition
We want to ind the most likely class c given the document d
CMAP = argmaxc∈C P(cc | dd) MAP is “maximum a posteriori” = most likely class
19
f
f
Naïve Bayes Classi ier
"Likelihood" "Prior"
c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c
20
f
Naïve Bayes Classi ier
"Likelihood" "Prior"
c c
d | c)P(c)
CMAP = argmaxc∈C P(d
Feature vector for d
= argmaxc∈C P(x1, x2, . . . , xn | cc)P(c)
c
20
f
f
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c
21
Solution: Naïve Bayes Independence Assumptions
Conditional Independence: Assume the feature probabilities
P(xi|cj) are independent given the class c.
c = P(x1 | c)
P(x1, x2, . . . , xn | c) c ⋅ P(x2 | c)
c ⋅ . . . ⋅ P(xn | c)
c
21
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c
f
22
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c
f
22
Naïve Bayes for Text Classi ication
Features: all the words in the document
CNB = argmaxc∈C P(c)Π d P(wi | c)
c i=1,...,|d| c
Notes:
1) Taking log doesn't change the ranking of classes - the class with
highest probability also has highest log probability.
2) It's a linear model: just a max of a sum of weights: a linear
combination of the inputs. So naïve bayes is a linear classi ier
f
23
f
Sec.13.3
( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3
( )=
number of docs of class j
^
Total number of docs
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3
( )=
number of docs of class j
^
Total number of docs
Cat
Train -
Documents
just plain boring
P( + ) = ?
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
𝑗
𝑐
24
𝑁
Sec.13.3
25
Sec.13.3
25
Sec.13.3
P(the | + ) = ?
Cat
Train -
Documents
just plain boring
A. 0
- entirely predictable and lacks energy
- no surprises and very few laughs B. 1/5
+ very powerful
+ the most fun film of the summer C. 2/9
Test ? predictable with no fun
25
Stop Words
26
Stop Words
26
Stop Words
26
Stop Words
26
Stop Words
27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).
27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).
27
Unknown Words
• Problem: unknown words may appear in the test set but not in the
training set (and vocabulary).
27
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
Train - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
28
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
( )=
^
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑗
𝑃
𝑐
28
𝑗
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2
P(predictable | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2
P(no | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1
P(fun | − ) = =
14 + 20 34
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Training (with Add-1 Smoothing)
Cat Documents
1⃣ Extract vocabulary: Train - just plain boring
- entirely predictable and lacks energy
V = {just, plain, . . . , summer} ⟹ | V | = 20 - no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
2⃣ Estimate class prior P(cj): Test ? predictable with no fun
3 2
( )=
^ P( − ) = , P( + ) =
5 5
( , )+1
3⃣ Compute likelihood P(wi|cj): ( )=
(∑ ∈
( , )) + | |
1+1 2 0+1 1
P(predictable | − ) = = P(predictable | + ) = =
14 + 20 34 9 + 20 29
1+1 2 0+1 1
P(no | − ) = = P(no | + ) = =
14 + 20 34 9 + 20 29
0+1 1 1+1 2
P(fun | − ) = = |
P(fun + ) = =
14 + 20 34 9 + 20 29
𝑤
𝑉
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑉
𝑡
𝑜
𝑡
𝑎
𝑙
𝑁
𝑖
𝑝
𝑤
𝑐
𝑗
𝑃
𝑐
28
𝑖
𝑗
𝑐
𝑜
𝑢
𝑛
𝑡
𝑤
𝑐
𝑐
𝑁
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Naïve Bayes Inference
3 2 2 1
P( − ) = , P( + ) = P(predictable | − ) = P(predictable | + ) =
5 5 34 29
2 1
P(no | − ) = P(no | + ) =
34 29
1 2
P(fun | − ) = P(fun | + ) =
34 29
29
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
30
f
Naïve Bayes and Language Modeling
31
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,
31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,
• Then Naïve Bayes is similar to language modeling:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)
31
f
Naïve Bayes and Language Modeling
• Naïve Bayes classi iers can use any sort of features:
URL, email addresses, dictionaries, …
• If features = all of the words in the text,
• Then Naïve Bayes is similar to language modeling:
CNB = argmaxcj∈C P(cj)Πi=1,...,nP(wi | cj)
The highlighted formula may be interpreted as:
A. The bigram model for class cj
B. The unigram model for class cj
C. Neither A not B 31
f
Naïve Bayes and Language Modeling
32
Naïve Bayes and Language Modeling
32
Naïve Bayes and Language Modeling
32
Naïve Bayes and Language Modeling
32
Naïve Bayes and Language Modeling
33
Summary: Naïve Bayes is Not So Naïve
• Very fast
• Low storage requirements
• Works well with small amounts of training data
• Interesting connection with LMs
• A good dependable baseline for text classi ication
• But we’ll see other classi iers with better accuracy
33
f
f
Today
• Text Classi ication
• Traditional (non-neural) supervised approaches
– Naïve Bayes
– Relation to LM
– Logistic Regression
34
f
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
35
f
Naïve Bayes Introduce terminology and common representations.
Generative
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Discriminative Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
35
f
Generative and Discriminative Classi iers
36
f
Generative Classi ier
😼 🐕
Cat Model Dog Model
37
f
Generative Classi ier
😼 🐕
Cat Model Dog Model
37
f
Generative Classi ier
😼 🐕
Cat Model Dog Model
37
f
Generative Classi ier
😼 🐕
Cat Model Dog Model
38
f
Discriminative Classi ier
Just try to distinguish dogs from cats
38
f
Discriminative Classi ier
Just try to distinguish dogs from cats
38
f
Text Classi ication in Generative & Discriminative Models
c d | c)
c ̂ = argmaxc∈C P(c)P(d c
c x∈X P(x | cc)
= argmaxc∈C P(c)Π
c | dd)
c ̂ = argmaxc∈C P(c
c | x1, . . . , xn)
= argmaxc∈C P(c
39
f
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
40
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
40
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
0 0 0 1 0 1 1 0 0 bag-of-words
41
Logistic Regression Inference
Weight vector: W = [w1, w2,…, wn]
how important is each feature for the positive class?
0 0 0 1 0 1 1 0 0 bag-of-words
z = w·x+ b
42
Sec.13.3
43
f
Multiclass (Multinomial) Logistic Regression
exp(zi)
softmax(zi) = K
Σi=1 exp(zi) 44
For Next Time
• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• MLP Classi ier - Chapter 7, section 7.4
45
f