You are on page 1of 12

# ON THE APPLICATION OF

LINGUISTIC QUANTIFIERS
FOR TEXT CATEGORIZATION

## Systems Research Institute

Polish Academy of Sciences

Warsaw, POLAND

## • Natural interpretation of various concepts of

information retrieval in terms of fuzzy logic
• Multilabel text categorization solved via similarity
based classifier
• Linguistic quantifier as a flexible aggregation
operator for similarity assessment and thresholding
strategy
BASICS OF TEXT CATEGORIZATION (1)

Notation

## T = {tj}j=1,M - a set of index terms

F: D × T → [0, 1] - assignment of weights to terms
appearing in a document

## C = {ci}i=1,S - a set of categories

Ξ: D × C → {0, 1} - assignment of categories to
documents
D1 ⊂ D - a set of training text documents
BASICS OF TEXT CATEGORIZATION (2)

Document representation

## “a bag of words” (index terms)

vector space model:
di → [wi1,...,wiM] wij = F(di,tj) di ∈ [0,1]M

## F(di,tj) = (fij ∗ log(N/nj)) / arg max (fij ∗ log(N / n j ) )

j
fij - frequency of a term tj in a document di,
N - number of all documents in the set D
nj - number of documents (in D) containing term tj

## first factor - frequency of term tj in di

(tf, term frequency)
second factor - inverted frequency (in the collection D)
of documents in which term tj appears at least once.
(idf, inverted document frequency)
BASICS OF TEXT CATEGORIZATION (3)

Classification problem

Ξ: D1 × C → {0, 1} ⇒ Ξ: D × C → {0, 1}

Two phases:

## • learning of classification rules (explicit or implicit;

building a classifier) from examples of documents
with known class assignment (supervised learning),

derived rules.

Special cases

## 1. C = {c1, c2} (single class)

2. ∀d ∃ci Ξ(d, ci) = 1 ∧ ∀cj≠ci Ξ(d, ci) = 0 (single label)

## Here: multiclass multilabel problem considered

ROCCHIO’S ALGORTIHM

Learning phase

## Computation of a centroid vector pi for each category ci

of documents:
P = {pi}i=1,...,S pi = [ui1,…,uiM]

## K is the number of documents of category ci

M is the number of terms used to index the documents

Classification phase

## Selection of a category whose centroid is most similar

to the document under consideration

## Classically, similarity assessment via cosine measure:

M M M
sim(d,p) = ∑ w i ui ∑ w i ∑ ui2
2
i =1 i =1 i =1

## where d = [w1,…,wM] and p = [u1,...,uM].

PROPOSED ALGORITHM (1)

Centroids

## uij = (fij ∗ log(S/nj)) / arg max (log(S / n j ) ∗ fij ) )

j
fij - frequency of term tj in all documents of category ci
nj - number of categories in documents of which term tj
appears (“category frequency”).

## By analogy to tf× ×idf it may be called a tf×

×icf
representation where icf stands for an inverted
category frequency.

## A document to be classified d is represented, as

previously, by a vector:

d = [w1,...,wM]
wj = (fj ∗ log(S/nj)) / arg max(log(S / n j ) ∗ f j ) )
j
fj - frequency of term tj in the document d
nj - category frequency of this term
PROPOSED ALGORITHM (2)

## As in original document the classification decision is

based on the similarity of a document and a centroid

## “terms representing both a document and a centroid

have comparable weights, relatively or absolutely”

## The centroids are aggregates of many documents, thus

one should not expect a match between a centroid and
a document along all dimensions of their
representation. More reasonable is to formulate the
requirements that

## This may be formalized as follows using the calculus of

linguistically quantified propositions:

## “A document matches a category if most of the

important terms of the document are present in the
centroid of the category”
CALCULUS OF LINGUISTICALLY
QUANTIFIED PROPOSITIONS

Q S (x )
∈X
x∈

## Most (Q) elements of a set X possessing property F

posses also property S

Q (F ( x ), S ( x ))or QF S ( x )
x ∈X x ∈X

X = { x1, xN }
µ S : X → [0,1],µ F : X → [0,1],µQ : [ 0,1] → [0,1]

1 N
truth( Q S ( x )) = µQ ( ∑ µS ( x i ))
∈X
x∈ N i =1

truth( Q (F ( x ), S( x ))) =
x ∈X
 N 
 ∑ ( µF ( x i ) ∧ µS ( x i )) 
 
µQ  i =1 
N
 ∑ µF ( x i ) 
 
 i =1 
PROPOSED ALGORITHM (3)

## “A document matches a category if most of the

important terms of the document are present in the
centroid of the category”

## X, the universe considered, is a set T of all index terms

B is a fuzzy set of terms important for the document d,
i.e., µB(tj) = wj
G is a fuzzy set of terms present in centroid pi of
category ci, i.e., µG(tj) = uij.

Then,

## We also tested a modified version where only terms

weighted in the document higher than a certain
threshold or only, say, 10 top ranked terms, are
considered.
THRESHOLDING STRATEGY (1)

## Many classifiers, including the one considered here,

produce for a document to be classified and each
category a matching degree expressing the extent to
which a document possibly belongs to given category.

## For single label problem a category with highest

matching degree is chosen

## In multilabel case a subset of categories has to be

chosen somehow - thresholding strategy is required

Classically:

## • rank-based thresholding (RCut),

choosing r top categories for each document
• proportion based assignment (PCut)
for “batch categorization” (a set/batch of documents has
to be classified at once) - assigns to each category such
a number of documents so as to preserve a proportion of
the cardinalities of categories in the training set
• score-based local optimization (SCut).
assigns a document to a category only if a matching
score of this category and document is higher than a
threshold.
THRESHOLDING STRATEGY (2)

## Our proposed thresholding strategy (RCut type) may be

intuitively expressed as follows:

## "Select such a threshold r that most of the important

categories had a number of sibling categories similar to
r in the training data set"
For each r ∈ [1,R] compute the truth degree of the
clause underlined above (R is a parameter).
Using Zadeh’s calculus of linguistically quantified
propositions):

## X, the universe considered, is a subset of C, of 10

categories with the highest matching scores,
B is a fuzzy set of important categories for a given
document d, i.e., µB(ci) = sim (d,ci)
G is a fuzzy set of categories, that, on average, had in
the training set the number of sibling categories similar
to r. This similarity is modeled by a similarity relation
which is another parameter of the method. By the
sibling category for a category ci we mean a category
that is assigned to the same document as the category
c i.
COMPUTATIONAL EXPERIMENT

## Text corpus used: Reuters-21578

7728 training, 3005 test documents and 114 categories
Stop words removing and stemming done
The terms space dimensionality reduction using
document and category frequency 5565 index terms.
Matching micro-averaging macro-averaging 11-point
scheme precisio recall F1 precisio recall F1 average
n n precisio
n
Method1 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
Method2* 0.4226 0.6765 0.5203 0.3416 0.6174 0.4398 0.7673
Cosine 0.2226 0.6462 0.3311 0.1235 0.4943 0.1976 0.6511
*
only terms weighted above 0.2 are considered in matching degree computation
Table 1: Comparison of matching schemes for T.II. thresholding
strategy
Thresh. micro-averaging macro-averaging 11-point
strateg precisio Recall F1 precisio Recall F1 average
y n n precisio
n
T.I. 0.5531 0.7765 0.6460 0.4891 0.4785 0.4837 0.8311
T.II 0.3914 0.8215 0.5302 0.4038 0.5322 0.4592 0.8311
T.III.* 0.4642 0.7478 0.5728 0.4776 0.4309 0.4530 0.8311
Table 2: Comparison of different thresholding strategies for the
Method1 matching scheme