Professional Documents
Culture Documents
Text Classification Paper 2
Text Classification Paper 2
on Imbalanced Data
ABSTRACT ignored.
A number of feature selection metrics have been explored
in text categorization, among which information gain (IG), One choice in the feature selection policy is whether to rule
chi-square (CHI), correlation coefficient (CC) and odds ra- out all negative features. Some argue that classifiers built
tios (OR) are considered most effective. CC and OR are from positive features only may be more transferable to new
one-sided metrics while IG and CHI are two-sided. Feature situations where the background class varies. Others believe
selection using one-sided metrics selects the features most that negative features are numerous, given the imbalanced
indicative of membership only, while feature selection us- data set, and quite valuable in practical experience. Their
ing two-sided metrics implicitly combines the features most experiments show that when deprived of negative features,
indicative of membership (e.g. positive features) and non- the performance of all feature selection metrics degrades,
membership (e.g. negative features) by ignoring the signs of which indicates negative features are essential to high qual-
features. The former never consider the negative features, ity classification [3]. We think that negative features are
which are quite valuable, while the latter cannot ensure the useful because their presence in a document highly indicates
optimal combination of the two kinds of features especially its non-relevance. Therefore, they help to confidently reject
on imbalanced data. In this work, we investigate the use- non-relevant documents.
fulness of explicit control of that combination within a pro-
posed feature selection framework. Using multinomial naı̈ve The focus in this paper is to answer the following three ques-
Bayes and regularized logistic regression as classifiers, our tions with empirical evidence:
experiments show both great potential and actual merits • How sub-optimal are two sided metrics?
of explicitly combining positive and negative features in a
nearly optimal fashion according to the imbalanced data. • To what extent can the performance be improved by
better combination of positive and negative features?
1. INTRODUCTION • How can the optimal combination be learned in prac-
Feature selection has been applied to text categorization tice?
in order to improve its scalability, efficiency and accuracy.
Since each document in the collection can belong to multi- The first two questions are concerned with the potential of
ple categories, the classification problem is usually split into optimal combination of positive and negative features, and
multiple binary classification problems with respect to each the last with a practical solution.
category. Accordingly, features are selected locally per cat-
egory, e.g. local feature selection. Imbalanced datasets are commonly encountered in text cat-
egorization problems, especially in the binary setting. A
A number of feature selection metrics have been explored, two-sided metric implicitly combines the positive and neg-
notable among which are Information Gain (IG), Chi-square ative features by simply ignoring the signs of features and
(CHI), Correlation Coefficient (CC), and Odds Ratio (OR) [8; comparing their values. However, the values of positive fea-
9; 10; 12; 15]. CC and OR are one-sided metrics which se- tures are not comparable with those of negative features due
lect the features most indicative of membership for a cate- to the imbalanced data. Furthermore, different performance
gory only, while IG and CHI are two-sided metrics, which measures attach different weights to positive and negative
consider the features most indicative of either membership features: the optimal size ratio between the two kinds of
(e.g. positive features) or non-membership (e.g. negative features should also depend on the performance measure.
features). A feature selection metric is considered as one- Two-sided metrics cannot ensure the optimal combination
sided if its positive and negative values correspond to posi- of the positive and negative features. However, neither one-
tive and negative features respectively. On the other hand, sided nor two-sided metrics themselves allow control of the
a two-sided metric is non-negative with the signs of features combination.
2. (t, ci ): presence of t and non-membership in ci . According to the definitions, OR considers the first two de-
pendency tuples, and IG, CHI, and CC consider all the four
3. (t, ci ): absence of t and membership in ci . tuples. CC and OR are one-sided metrics, whose positive
and negative values correspond to the positive and negative
4. (t, ci ): absence of t and non-membership in ci . features respectively. On the other hand, IG and CHI are
two-sided, whose values are non-negative. We can easily ob-
where: t and ci represent a term and a category respec-
tain that the sign for a one-sided metric, e.g. CC or OR, is
tively. The frequencies of the four tuples in the collection
sign(AD − BC).
are denoted by A, B, C and D respectively. The first and
last tuples represent the positive dependency between t and
ci , while the other two represent the negative dependency. A one-sided metric could be converted to its two-sided coun-
terpart by ignoring the sign, while a two-sided metric could
Information gain (IG) Information gain [12; 15] mea- be converted to its one-sided counterpart by recovering the
sures the number of bits of information obtained for sign, e.g. CHI vs. CC.
category prediction by knowing the presence or ab-
sence of a term in a document. The information gain We propose the two-sided counterpart of OR, namely OR-
of term t and category ci is defined to be: square, and the one-sided counterpart of IG, namely signed
IG as follows.
X X P (t0 , c) OR-square (ORS) and Signed IG (SIG)
IG(t, ci ) = P (t0 , c) · log
P (t0 ) · P (c)
c∈{ci ,ci } t0 ∈{t,t} ORS(t, ci ) = OR2 (t, ci ),
Information gain is also known as Expected Mutual In- SIG(t, ci ) = sign(AD − BC) · IG(t, ci )
formation. The Expected Likelihood Estimation (ELE)
smoothing technique was used in this paper to handle The overall feature selection procedure is to score each po-
singularities when estimating those probabilities. tential feature according to a particular feature selection
Chi-square (CHI) Chi-square measures the lack of in- metric, and then take the best features. Feature selection
dependence between a term t and a category ci and using one-sided metrics like SIG, CC, and OR pick out the
can be compared to the chi-square distribution with terms most indicative of membership only. The basic idea
one degree of freedom to judge extremeness [12; 15]. behind this is the features coming from non-relevant docu-
It is defined as: ments are useless. They will never consider negative features
unless all the positive features have already been selected.
Feature selection using two-sided metrics like IG, CHI, and
N [P (t, ci )P (t, ci ) − P (t, ci )P (t, ci )]2 ORS, however, do not differentiate between the positive and
χ2 (t, ci ) = negative features. They implicitly combine the two.
P (t)P (t)P (ci )P (ci )
where: N is the total number of documents.
3. THE IMBALANCED DATA PROBLEM
Correlation coefficient (CC) Correlation coefficient of The imbalanced data problem occurs when the training ex-
a word t with a category ci was defined by Ng et al. amples are unevenly distributed among different classes. In
as [12; 10] case of binary classification, the number of examples in one
1. select the positive features only using one-sided met- Accordingly, a set of new methods corresponding to = =
rics, e.g. SIG, CC, and OR. For convenience, we will SIG, CC and OR are proposed for each of the two scenar-
use CC as the representative of this group. ios, where l1 /l ∈ (0, 1] is empirically chosen per category
according to the scenario. The first set of methods are re-
2. implicitly combine the positive and negative features ferred to as improved SIG, CC and OR in ideal scenario
using two-sided metrics, e.g. IG, CHI, and ORS. CHI while the other set are improved SIG, CC and OR in prac-
will be chosen to represent this group. tical scenario.
The two groups are two special cases of our feature selec-
The efficient implementation of optimization within the frame-
tion framework. The standard feature selection using CC
work is as follows:
corresponds to the case where = = CC, l1 /l = 1. The
standard method using CHI corresponds to the case where • select l positive features with greatest =(t, ci ) in a de-
= = CC, and l1 /l is implicitly set as follows. Considering creasing order.
CHI, we have Fi = M ax[χ2 (·, ci ), l]. The positive subset of
0
Fi is Fi = {t ∈ Fi |CC(t, ci ) > 0}. The feature set Fi can be • select l negative features with smallest =(t, ci ) in an
equivalently obtained as a combination of the terms most increasing order.
indicative of membership and non-membership:
• empirically choose the size ratio l1 /l such that the fea-
0
Fi+ = M ax[CC(·, ci ), |Fi |]; ture set constructed by combining the first l1 , 0 < l1 <
l, positive features and the first l − l1 negative features
0 has the optimal performance.
Fi− = M in[CC(·, ci ), l − |Fi |];
Therefore, during the optimization of size ratio l1 /l for each
Fi = Fi+ ∪ Fi− category, we did not conduct feature selection for each pos-
Where: M ax[=(·, ci ), l] and M in[=(·, ci ), l] represent the l sible l1 /l, but once only.
features with highest and smallest = values respectively.
0
5. EXPERIMENTAL SETUP
Note that Fi ≡ Fi+ . It illustrates the standard feature se- In order to determine the usefulness of the new feature selec-
lection using CHI is a special case of the framework, where tion methods, we conduct experiments on standard text cat-
0
l1 = |Fi | is internally decided by the size of feature set l egorization data using two popular classifiers: naı̈ve Bayes
given the data set. and logistic regression.
P 0.03
logP (ci ) + fj logP (fj |ci )
Score(d, ci ) = P
logP (ci ) + fj logP (fj |ci ) 0.02
Table 2: As table 1, but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for NB without feature selection is .483
The methods are compared to each other within the same the best performance of the improved methods in ideal sce-
group at different sizes of features. Typical size of a lo- nario (iSIG: .81, iCC: .812, and iOR: .816) shows that the
cal feature set is between 10 and 50 [10]. In this paper, the improved feature selection methods can be very helpful to
performance are reported at a much wider range: 10 ∼ 3000. LR.
Tables 1 and 2 list the micro and macro averaged BEP F1 Figure 2 show the size ratios implicitly decided by two-sided
values for naı̈ve Bayes classifiers with the nine different fea- metrics: IG, CHI and ORS respectively (feature size = 50),
ture selection methods (as listed in the first row) at differ- which confirms that feature selection using a two-sided met-
ent sizes of feature set ranging from 10 to 3000 (as listed in ric is similar to its one-sided counterpart(size ratio = 1)
the first column); Tables 3 and 4 list the micro and macro when the feature size is small. When using CHI, only pos-
averaged BEP F1 values for regularized logistic regression itive features are considered for 56 categories(equivalent to
classifiers. The best micro averaged F1 across different sizes CC) and only one negative feature is included for each of
of feature set is highlighted for each method. We can see the the remaining two categories(3rd and 7th).
improved methods in ideal scenario significantly outperform
the corresponding one-sided and two-sided methods, which Figure 3 visualizes the optimization for the first two (money-
indicates the great potential of optimal combination of pos- fx and grain) and last two categories (dmk and lumber) us-
itive and negative features. ing NB, where = = CC, feature size = 50. Figures 4 and 5
show the optimal size ratios for the 58 categories using NB
From tables 1 and 2, we can see it is very useful to con- and LR respectively. Both are quite different from figure
duct feature selection for NB (The micro-averaged F1 for 2, which confirms that implicit combination using two-sided
NB without feature selection is .641). On the other side, metrics are not optimal. The 58 categories are ordered by
tables 3 and 4 show that standard feature selection, e.g. IG, the numbers of their training documents. Intuitively, given
SIG, CHI, CC, ORS and OR, will not improve LR’s per- the fixed size of feature set, 50 in this case, the optimal size
formance (The micro-averaged F1 for LR without feature ratio should decrease with category id increases. We can see
selection is .766), which confirms the general conseus that in figures 4 and 5 that the optimal size ratios vibrate be-
standard feature selection will not help the regularized lin- tween 0 and 1 from category to category irregularly though
ear methods, e.g. SVMs, LR and ridge regression. However, the general trend confirms the intuition. Therefore, besides
Table 4: As table 3, but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for LR without feature selection is .676
1
number of positive examples, category(domain) character-
istics also have effects on optimal feature selection. Our 0.9
results also show the optimal size ratios learned by NB with 0.8
0.4
20 .73 .673 .734 .672
30 .719 .657 .741 .679
0.3
40 .737 .676 .754 .691
0.2 50 .74 .68 .758 .696
0.1
money−fx 100 .738 .673 .763 .694
grain
dmk
lumber
200 .717 .664 .769 .712
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 500 .714 .633 .77 .702
size ratio of positive features in the feature set
1000 .693 .593 .772 .704
Figure 3: BEP F1 for test over the first two and last two 2000 .687 .591 .769 .696
categories (out of 58 categories) at different l1 /l values (NB, 3000 .686 .584 .771 .693
= = CC, feature size = 50). The optimal size ratios for
money-fx, grain, dmk and lumber are 0.95, 0.95, 0.1 and 0.2
respectively Table 6: BEP F1 values of NB and LR for the two most
common categories: 1st and 2nd. iCC and iCC’ represent
ideal and practical scenarios respectively, feature size = 50
1
iSIG
iCC
NB LR
0.9 iOR
CHI CC iCC iCC’ CHI CC iCC iCC’
0.8
earn .957 .914 .958 .956 .971 .959 .974 .97
0.7 acq .871 .819 .917 .907 .897 .858 .926 .926
0.6
optimal size ratio
0.5
0.4 tion (.766). This indicates that with the improved feature
0.3
selection methods such as iCC, NB is competitive with state-
of-the-art classification methods such as LR.
0.2
0.1
Obviously, the efficiency of the improved methods mainly
0
5 10 15 20 25 30 35 40 45 50 55 60 depends on the classification method used to learn the op-
category id
timal size ratios. For those fast algorithms such as NB, lin-
ear regression, etc, the optimization is reasonably efficient.
Figure 4: Optimal size ratios of iSIG, iCC and iOR (NB, 58 Since the optimization of size ratio for one category is in-
categories:3rd-60th, feature size = 50) dependent with other categories, parallel computing can be
performed for those time-consuming classifiers, e.g. kNN,
neural networks, SVMs, LR, etc, with many features (fea-
ture size > 1000). Possible further work includes an analytic
1 solution to finding optimal size ratios based on the category
iSIG
0.9
iCC
iOR
characteristics.
0.8
6.3 Additional results
0.7 In order to investigate the effect of our new methods on
0.6 balanced data, we report in table 6 the performance val-
optimal size ratio
0.5
ues for the two most common categories: earn and acq
separately. For earn, no performance gain is observed by
0.4
applying our new methods. It’s due to the well-balanced na-
0.3 ture of this category: around 25 of its training examples are
0.2 positive. However, the performance of acq was significantly
0.1
improved. This can be partially explained by the fact that
this category is more imbalanced than earn: only around 51
0
5 10 15 20 25 30
category id
35 40 45 50 55 60 of its training examples are positive.
Figure 5: As figure 4, but for LR In order to compare this work with others, we also show in
figures 6 and 7 the micro-averaged BEP F1 values for the
second group of feature selection methods on all 90 cate-
0.88
CC
CHI
0.86 iCC−practical 0.88
iCC−ideal
0.84
0.86
0.82
micro−averaged F1
micro−averaged F1
0.8
0.84
0.78
0.82
0.76
0.74
0.8
CC
0.72 CHI
iCC−practical
iCC−ideal
0.7 0.78
1 2 3 4 1 2 3 4
10 10 10 10 10 10 10 10
Number of features Number of features
Figure 6: Micro-averaged BEP F1 values for CC, CHI, iCC Figure 7: As figure 6, but for LR
in practical scenario, and iCC in ideal scenario (NB, all 90
categories)
2. implicitly combine the positive and negative features
by using two-sided metrics;
gories. We can see in figure 6 the micro-averged F1 of all 90
categories for NB using CHI (with 2000 features) is 80.4%. 3. combine the two kinds of features explicitly and choose
The micro averaged F1(BEP) of LR without feature selec- the size ratio empirically such that optimal perfor-
tion is .854 as shown in figure 7. Both are consistent with mance is obtained.
previous published results [14; 16; 17]. The micro averaged
The first two cases are known and standard, and the last
F1(BEP) score of LR is slightly lower than [17] due to a dif-
one is new. The main conclusions are:
ference of treating document titles. [17; 16] treated words
in document titles as different words in document bodies • Implicitly combining positive and negative features us-
while we consider them equivalently. ing two-sided metrics is not necessarily optimal, espe-
cially on imbalanced data.
As we expect, the improved methods consistently outper-
form standard CHI and CC. However, since the micro-averaged • A judicious combination shows great potential and prac-
F1 over all 90 categories is dominated by the most common tical merits.
category, which happen to be well-balanced, the improve-
ment is not as impressive. We can also see in figures 6 and • A good feature selection method should take into con-
7 that CHI is always better than CC even when the feature sideration the data set, performance measure, and clas-
size is small. For the two most common categories, CHI sification methods.
and CC will choose features quite differently since the CHI • Feature selection can significantly improve the perfor-
values of positive and negative features are comparable on mance of both naı̈ve Bayes and regularized logistic re-
balanced data. In that case, CHI will select useful negative gression on imbalanced data.
features as well even when the feature size is small. Another
interesting point in figure 6 is that with the number of fea- Furthermore, we observe that multinomial naı̈ve Bayes with
tures increases, the performance of CC first increases, then our proposed feature selection is competitive with the state-
decreases, and finally increases again. The first increase is of-the-art algorithms such as regularized logistic regression.
due to the inclusion of more useful positive features. The
afterward decrease is due to the inclusion of noisy or non- 8. ACKNOWLEDGEMENTS
indicative (1) positive and (2) negative features. The final
increase is because of the inclusion of useful negative fea- We would like to thank Dr. David Pierce, Dr. Miguel Ruiz,
tures. In contrast with figure 6, figure 7 does not see notice- Mr. Wei Dai, and the anonymous reviewers for valuable
able performance drop for CHI and CC, which is because comments and suggestions. Thanks to Mr. Jian Zhang and
that due to the regularization factor of LR, the inclusion of Dr. Tong Zhang for discussions about the implementation
noisy features has much less impact on LR than NB. of logistic regression and evaluation issues.
7. CONCLUSION 9. REFERENCES
In order to investigate the usefulness of experimenting with [1] S. Dumais, J. Platt, D. Heckerman, and M. Sahami.
the combination of positive and negative features, a novel Inductive learning algorithms and representations for
feature selection framework was presented, in which the pos- text categorization. Proceedings of the Seventh Interna-
itive and negative features are separately selected and ex- tional Conference on Information and Knowledge Man-
plicitly combined. We explored three special cases of the agement, pages 148–155, 1998.
framework:
[2] T. Fawcett and F. Provost. Adaptive fraud detection.
1. consider the positive features only by using one-sided Data Mining and Knowledge Discovery, 1(3):291–316,
metrics; 1997.