You are on page 1of 14

Feature Selection with Conditional Mutual

Information Maximin in Text Categorization

Department of Computer Science,


Hong Kong University of Science and Technology
Outline
• Introduction
• Information Theory Review
• Conditional Mutual Information Maximin (CMIM)
• Experimental Result
• Conclusion
Introduction
• Text Categorization
– Preprocessing Important

– Feature selection
• filter method
− Training the classifier
• wrapper method
− Testing
• embedded method
− Training the classifier
− Testing
Classic Feature Selection Methods
• Ranking Criterion
– Information gain
– Mutual information
– χ 2
test
Documents w1 w2 w3 class
d1 2 2 0 c1

• Drawback d2 1 1 0 c1
d3 0 0 0 c2
– Regardless of relationship d4 0 0 2 c2
among features
Information Theory Review
• Entropy: H(X)
H(X) = - ∑p(x) log p(x) = - E( log p(x))
• Mutual Information: I(X;Y)
I(X;Y)= - E( log (p(x,y)/p(x)p(y)) )
• Conditional MI: I(X;Y|Z)
I(X;Y|Z)= - E( log (p(x,y|z)/p(x|z)p(y|z)) )
H(Y)
H(X)

H(X|Y)
I(X;Y|Z)
I(X;Y)
Conditional Mutual Information
• Mutual information as ranking criterion:
– I(Fi; C): goodness between category C and feature Fi
– I(F1,…,Fk; C): the disaster when k is large
• I(F1,…,Fk; C) = I(F1,…,Fk-1; C) + I(Fk; C | F1,…,Fk-1)
Since I(Fk; C | F1,…,Fk-1) >= 0,
so I(F1,…,Fk; C) >= I(F1,…,Fk-1; C)
• Another approach to compute CMI is from JMI
Find best Fk based on I(Fk; C | F1,…,Fk-1)
• The same disaster as JMI
Intuition of CMIM Algorithm
• Try to approximate CMI
I(F*; C | F1,…,Fk) ≈ min I(F*; C | Fi,…,Fj)
k-1
I(F*; C | F1,…,Fk) ≈ min I(F*; C | Fi,…,Fj)
……
k-2

I(F*; C | F1,…,Fk) ≈ min I(F*; C | Fi, Fj)


I(F*; C | F1,…,Fk) ≈ min I(F*; C | Fi )
CMIM Algorithm
Input: n – the number of features to be selected
v – the number of total features
Output: F – the set for selected features

1. Set F to be empty
2. m=1
3. Add Fi in F, where Fi = argmaxi=1..v I(Fi;C)
4. Repeat
5. m++
6. add Fi in F, where Fi = argmaxi=1..v {min Fj ∈F I(Fi;C|Fj)}
7. Until m=n
Experiment Setup
• Dataset:
– WebKB: 4199 pages, 4 categories
– NewsGroups: 20000 pages, 10 categories
• Feature selection criterion:
– CMIM
– Information gain (IG)
• Classifier:
– Naïve Bayes
– Support vector machine
Result for WebKB
Micro-averaged accuracy Macro-averaged accuracy

SVM

NB
Result for WebKB
• Confusion matrices based on IG and CMIM, using
SVM classifier.
Result for Newsgroup
Micro-averaged accuracy Macro-averaged accuracy

SVM

NB
Discussion
• Feature size
AccCMIM >> AccIG when small feature size
• Category number
AccCMIM >> AccIG when small category number
• Category deviation
MicroAccCMIM >> MicroAccIG when large deviation
Conclusion
• Use simple triplet to approximate joint conditional
mutual information
• CMIM algorithm tries to reduce the correlation among
features
• Complexity is O(NV3)

You might also like