You are on page 1of 12

ON THE APPLICATION OF

LINGUISTIC QUANTIFIERS
FOR TEXT CATEGORIZATION

Slawomir Zadrozny and Janusz Kacprzyk

Systems Research Institute


Polish Academy of Sciences

Warsaw School of Information Technology

Warsaw, POLAND

• Natural interpretation of various concepts of


information retrieval in terms of fuzzy logic
• Multilabel text categorization solved via similarity
based classifier
• Linguistic quantifier as a flexible aggregation
operator for similarity assessment and thresholding
strategy
BASICS OF TEXT CATEGORIZATION (1)

Notation

D = {di}i=1,N - a set of text documents

T = {tj}j=1,M - a set of index terms


F: D × T → [0, 1] - assignment of weights to terms
appearing in a document

C = {ci}i=1,S - a set of categories


Ξ: D × C → {0, 1} - assignment of categories to
documents
D1 ⊂ D - a set of training text documents
BASICS OF TEXT CATEGORIZATION (2)

Document representation

“a bag of words” (index terms)


vector space model:
di → [wi1,...,wiM] wij = F(di,tj) di ∈ [0,1]M

tf × idf is a popular version of function F:

F(di,tj) = (fij ∗ log(N/nj)) / arg max (fij ∗ log(N / n j ) )


j
fij - frequency of a term tj in a document di,
N - number of all documents in the set D
nj - number of documents (in D) containing term tj

first factor - frequency of term tj in di


(tf, term frequency)
second factor - inverted frequency (in the collection D)
of documents in which term tj appears at least once.
(idf, inverted document frequency)
BASICS OF TEXT CATEGORIZATION (3)

Classification problem

Ξ: D1 × C → {0, 1} ⇒ Ξ: D × C → {0, 1}

Two phases:

• learning of classification rules (explicit or implicit;


building a classifier) from examples of documents
with known class assignment (supervised learning),

• classification of documents unseen earlier using


derived rules.

Special cases

1. C = {c1, c2} (single class)


2. ∀d ∃ci Ξ(d, ci) = 1 ∧ ∀cj≠ci Ξ(d, ci) = 0 (single label)

Here: multiclass multilabel problem considered


ROCCHIO’S ALGORTIHM

Learning phase

Computation of a centroid vector pi for each category ci


of documents:
P = {pi}i=1,...,S pi = [ui1,…,uiM]

uij = 1/K ∑l=1,…,K wlj

K is the number of documents of category ci


M is the number of terms used to index the documents

Classification phase

Selection of a category whose centroid is most similar


to the document under consideration

Classically, similarity assessment via cosine measure:

M M M
sim(d,p) = ∑ w i ui ∑ w i ∑ ui2
2
i =1 i =1 i =1

where d = [w1,…,wM] and p = [u1,...,uM].


PROPOSED ALGORITHM (1)

Centroids

uij = (fij ∗ log(S/nj)) / arg max (log(S / n j ) ∗ fij ) )


j
fij - frequency of term tj in all documents of category ci
nj - number of categories in documents of which term tj
appears (“category frequency”).

By analogy to tf× ×idf it may be called a tf×


×icf
representation where icf stands for an inverted
category frequency.

A document to be classified d is represented, as


previously, by a vector:

d = [w1,...,wM]
wj = (fj ∗ log(S/nj)) / arg max(log(S / n j ) ∗ f j ) )
j
fj - frequency of term tj in the document d
nj - category frequency of this term
PROPOSED ALGORITHM (2)

As in original document the classification decision is


based on the similarity of a document and a centroid

Intuitive interpretation of this similarity:

“terms representing both a document and a centroid


have comparable weights, relatively or absolutely”

The centroids are aggregates of many documents, thus


one should not expect a match between a centroid and
a document along all dimensions of their
representation. More reasonable is to formulate the
requirements that

“along most of the dimensions there is a match”

This may be formalized as follows using the calculus of


linguistically quantified propositions:

“A document matches a category if most of the


important terms of the document are present in the
centroid of the category”
CALCULUS OF LINGUISTICALLY
QUANTIFIED PROPOSITIONS

Most (Q) elements of a set X posses property S

Q S (x )
∈X
x∈

Most (Q) elements of a set X possessing property F


posses also property S

Q (F ( x ), S ( x ))or QF S ( x )
x ∈X x ∈X

X = { x1, xN }
µ S : X → [0,1],µ F : X → [0,1],µQ : [ 0,1] → [0,1]

1 N
truth( Q S ( x )) = µQ ( ∑ µS ( x i ))
∈X
x∈ N i =1

truth( Q (F ( x ), S( x ))) =
x ∈X
 N 
 ∑ ( µF ( x i ) ∧ µS ( x i )) 
 
µQ  i =1 
N
 ∑ µF ( x i ) 
 
 i =1 
PROPOSED ALGORITHM (3)

“A document matches a category if most of the


important terms of the document are present in the
centroid of the category”

Q B’s are G’s

X, the universe considered, is a set T of all index terms


B is a fuzzy set of terms important for the document d,
i.e., µB(tj) = wj
G is a fuzzy set of terms present in centroid pi of
category ci, i.e., µG(tj) = uij.

Then,

similarity(d,p) = truth(Q B’s are G’s)

We also tested a modified version where only terms


weighted in the document higher than a certain
threshold or only, say, 10 top ranked terms, are
considered.
THRESHOLDING STRATEGY (1)

Many classifiers, including the one considered here,


produce for a document to be classified and each
category a matching degree expressing the extent to
which a document possibly belongs to given category.

For single label problem a category with highest


matching degree is chosen

In multilabel case a subset of categories has to be


chosen somehow - thresholding strategy is required

Classically:

• rank-based thresholding (RCut),


choosing r top categories for each document
• proportion based assignment (PCut)
for “batch categorization” (a set/batch of documents has
to be classified at once) - assigns to each category such
a number of documents so as to preserve a proportion of
the cardinalities of categories in the training set
• score-based local optimization (SCut).
assigns a document to a category only if a matching
score of this category and document is higher than a
threshold.
THRESHOLDING STRATEGY (2)

Our proposed thresholding strategy (RCut type) may be


intuitively expressed as follows:

"Select such a threshold r that most of the important


categories had a number of sibling categories similar to
r in the training data set"
For each r ∈ [1,R] compute the truth degree of the
clause underlined above (R is a parameter).
Using Zadeh’s calculus of linguistically quantified
propositions):

Q B’s are G’s

X, the universe considered, is a subset of C, of 10


categories with the highest matching scores,
B is a fuzzy set of important categories for a given
document d, i.e., µB(ci) = sim (d,ci)
G is a fuzzy set of categories, that, on average, had in
the training set the number of sibling categories similar
to r. This similarity is modeled by a similarity relation
which is another parameter of the method. By the
sibling category for a category ci we mean a category
that is assigned to the same document as the category
c i.
COMPUTATIONAL EXPERIMENT

Text corpus used: Reuters-21578


7728 training, 3005 test documents and 114 categories
Stop words removing and stemming done
The terms space dimensionality reduction using
document and category frequency 5565 index terms.
Matching micro-averaging macro-averaging 11-point
scheme precisio recall F1 precisio recall F1 average
n n precisio
n
Method1 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
Method2* 0.4226 0.6765 0.5203 0.3416 0.6174 0.4398 0.7673
Cosine 0.2226 0.6462 0.3311 0.1235 0.4943 0.1976 0.6511
*
only terms weighted above 0.2 are considered in matching degree computation
Table 1: Comparison of matching schemes for T.II. thresholding
strategy
Thresh. micro-averaging macro-averaging 11-point
strateg precisio Recall F1 precisio Recall F1 average
y n n precisio
n
T.I. 0.5531 0.7765 0.6460 0.4891 0.4785 0.4837 0.8311
T.II 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
T.III.* 0.4642 0.7478 0.5728 0.4776 0.4309 0.4530 0.8311
Table 2: Comparison of different thresholding strategies for the
Method1 matching scheme