Textclassification PDF

Text Classification
SNLP 2016
CSE, IIT Kharagpur
November 7th, 2016
SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 1 / 29

Example: Positive or negative movie review?

Example: Male or Female Author?

Example: What is the subject of this article?

Taxt Classification
Assigning subject categories, topics, or genres

Spam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis
...

Text classification: problem definition
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }

Text classification: problem definition
Input
A document d
A fixed set of classes C = {c1 , c2 , . . . , cn }
Output
A predicted class c ∈ C

Classification Methods: Hand-coded rules
Rules based on combinations of words or other features
Spam

Spam
black-list-address OR (“dollars” AND “have been selected”)

Spam
black-list-address OR (“dollars” AND “have been selected”)
Pros and Cons

Accuracy can be high if rules carefully refined by expert, but building and
maintaining these rules is expensive.

Classification Methods: Supervised Machine Learning
Naïve Bayes
Logistic regression
Support-vector machines
...

Naïve Bayes Intuition
Simple classification method based on Bayes’ rule

Relies on very simple representation of document - Bag of words

Bag of words for document classification

Bayes’ rule for documents and classes
For a document d and a class c
P(d|c)P(c)
P(c|d) =
P(d)

Bayes’ rule for documents and classes
For a document d and a class c
P(d|c)P(c)
P(c|d) =
P(d)
Naïve Bayes Classifier
cMAP = arg max P(c|d)

c∈C
= arg max P(d|c)P(c)

c∈C
= arg max P(x1 , x2 , . . . , xn |c)P(c)

c∈C

Naïve Bayes classification assumptions
P(x1 , x2 , . . . , xn |c)

P(x1 , x2 , . . . , xn |c)
Bag of words assumption

Assume that the position of a word in the document doesn’t matter

P(x1 , x2 , . . . , xn |c)

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) . . . P(xn |c)

P(x1 , x2 , . . . , xn |c)

Conditional Independence
Assume the feature probabilities P(xi |cj ) are independent given the class cj .
P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) . . . P(xn |c)
cNB = arg max P(c) ∏ P(x|c)

c∈C x∈X

Learning the model parameters
Maximum Likelihood Estimate

doc − count(C = cj )
P̂(cj ) =
Ndoc
count(wi , cj )
P̂(wi |cj ) =
∑ count(w, cj )
w∈V


P̂(cj ) =
Ndoc
count(wi , cj )
P̂(wi |cj ) =
∑ count(w, cj )
w∈V
Problem with MLE

Suppose in the training data, we haven’t seen the word “fantastic”, classified in
the topic ‘positive’.
P̂(fantastic|positive) = 0


P̂(cj ) =
Ndoc
count(wi , cj )
P̂(wi |cj ) =
∑ count(w, cj )
w∈V
Problem with MLE

Suppose in the training data, we haven’t seen the word “fantastic”, classified in
the topic ‘positive’.
P̂(fantastic|positive) = 0
cNB = arg max P̂(c) ∏ P̂(xi |c)

c x∈X

Laplace (add-1) smoothing
count(wi , c) + 1
P̂(wi |c) =
∑ (count(w, c) + 1)
w∈V
count(wi , c) + 1
=
( ∑ (count(w, c)) + |V|
w∈V

A worked example

A worked example

A worked example

A worked example

A worked example

A worked example

Naïve Bayes and Language Modeling
In general, NB classifier can use any feature

URL, email addresses, dictionaries, network features


But if we use only the word features and all the words in the text
Naïve Bayes has an important similarity to language modeling.


But if we use only the word features and all the words in the text
Naïve Bayes has an important similarity to language modeling.
Each class can be thought of as a separate unigram language model.

Naïve Bayes as Language Modeling
Which class assigns a higher probability to the sentence?

Naïve Bayes: More than Two Classes
Multi-value classification
A document can belong to 0, 1 or > 1 classes

Handling Multi-value classification

For each class c ∈ C, build a classifier γc to distinguish c from all other
classes c0 ∈ C


classes c0 ∈ C
Given test-doc d, evaluate it for membership in each class using each γc


classes c0 ∈ C
d belongs to any class for which γc returns true

One-of or multinomial classification

Classes are mutually exclusive: each document in exactly one class


Binary classifiers may also be used

classes c0 ∈ C



classes c0 ∈ C



classes c0 ∈ C
d belongs to one class with maximum score

Evaluation: Constructing Confusion matrix c
For each pair of classes < c1 , c2 > how many documents from c1 were
incorrectly assigned to c2 ? (when c2 6= c1 )

Per class evaluation measures
Recall

Recall
cii
Fraction of docs in class i classified correctly:
∑ cij
j

Recall
cii
∑ cij
j
Precision
Fraction of docs assigned class i that are actually about class i:

Recall
cii
∑ cij
j
Precision
cii
∑ cji
i

Recall
cii
∑ cij
j
Precision
cii
∑ cji
i
Accuracy
∑ cii
Fraction of docs classified correctly: i N

Micro- vs. Macro-Average
If we have more than one class, how do we combine multiple performance

measures into one quantity?


Macro-averaging
Compute performance for each class, then average


Macro-averaging
Compute performance for each class, then average
Micro-averaging
Collect decisions for all the classes, compute contingency table, evaluate.


Macro-averaged precision:

Macro-averaged precision: (0.5 + 0.9)/2 = 0.7

Micro-averaged precision:


Micro-averaged precision: 100/120 = 0.83


Micro-averaged precision: 100/120 = 0.83
Micro-averaged score is dominated by score on common classes

Sample Problem
In a task of text-classification, assume that you want to categorize various documents into 3
classes, ‘politics’, ‘sports’ and ‘nature’. The table below shows 16 such documents, their true
labels and the label assigned by your classifier.
Doc. Actual Predicted Doc. Actual Predicted

1 politics sports 2 politics politics
3 sports sports 4 nature politics
5 politics politics 6 nature nature
7 sports sports 8 sports nature
9 politics politics 10 politics politics
11 politics nature 12 politics sports
13 sports politics 14 nature politics
15 nature nature 16 politics nature
Construct the confusion matrix and find the macro-averaged and micro-averaged precision
values. Which category has the highest recall?

Textclassification PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Textclassification PDF

Uploaded by

Copyright:

Available Formats

Text Classification

CSE, IIT Kharagpur

November 7th, 2016

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 1 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 2 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 3 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 4 / 29

Assigning subject categories, topics, or genres

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 5 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 6 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 6 / 29

Rules based on combinations of words or other features

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 7 / 29

Rules based on combinations of words or other features

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 7 / 29

Rules based on combinations of words or other features

Pros and Cons

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 7 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 8 / 29

Simple classification method based on Bayes’ rule

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 9 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 10 / 29

For a document d and a class c

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 11 / 29

For a document d and a class c

Naïve Bayes Classifier

cMAP = arg max P(c|d)

= arg max P(d|c)P(c)

= arg max P(x1 , x2 , . . . , xn |c)P(c)

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 11 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 12 / 29

Bag of words assumption

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 12 / 29

Bag of words assumption

P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) . . . P(xn |c)

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 12 / 29

Bag of words assumption

P(x1 , x2 , . . . , xn |c) = P(x1 |c) · P(x2 |c) . . . P(xn |c)

cNB = arg max P(c) ∏ P(x|c)

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 12 / 29

Maximum Likelihood Estimate

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 13 / 29

Maximum Likelihood Estimate

Problem with MLE

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 13 / 29

Maximum Likelihood Estimate

Problem with MLE

cNB = arg max P̂(c) ∏ P̂(xi |c)

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 13 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 14 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 15 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 16 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 17 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 18 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 19 / 29

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 20 / 29

In general, NB classifier can use any feature

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 21 / 29

In general, NB classifier can use any feature

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 21 / 29

In general, NB classifier can use any feature

SNLP 2016 (IIT Kharagpur) Text Classification November 7th, 2016 21 / 29