Professional Documents
Culture Documents
Feature Selection and Extraction
Feature Selection and Extraction
Categorization
David D.Lewis
Center for Information and Language Studies
University of Chicago
Chicago, IL 60637
The indexing language used to represent texts influences The second data set consisted of 1,500 documents from
how easily and effectively a text categorization system the US. Foreign Broadcast Information Service (FBIS)
can be built, whether the system is built by human engi- that had previously been used in the MUC-3 evaluation
neering, statistical training, or a combination of the two. of natural language processing systems [2]. The docu-
The simplest indexing languages are formed by treating ments are mostly translations from Spanish to English,
each word as a feature. However, words have properties, and include newspaper stories, transcripts of broadcasts,
such as synonymy and polysemy, that make them a less communiques, and other material.
than ideal indexing language. These have motivated at-
tempts to use more complex feature extraction methods The MUC-3 task required extracting simulated database
in text retrieval and text categorization tasks. records ("templates") describing terrorist incidents from
these texts. Eight of the template slots had a limited
If a syntactic parse of text is available, then features number of possible fillers, so a simplification of the MUC-
can be defined by the presence of two or more words in 3 task is to view filling these slots as text categorization.
particular syntactic relationships. We call such a fea- There were 88 combinations of these 8 slots and legal
ture a syntactic indexing phrase. Another strategy is to fillers for the slots, and each was treated as a binary cat-
use cluster analysis or other statistical methods to de- egory. Other text categorization tasks can be defined for
tect closely related features. Groups of such features can the MUC-3 data (see Riloff and Lehnert in this volume).
then, for instance, be replaced by a single feature corre-
sponding to their logical or numeric sum. This strategy We used for our test set the 200 official MUC-3 test doc-
is referred to as term clustering. uments, plus the first 100 training documents (DEV-
MUC3-0001 through DEV-MUC3-0100). Templates for
Syntactic phrase indexing and term clustering have op- these 300 documents were encoded by the MUC-3 orga-
nizers. We used the other 1,200 MUC-3 training docu- 0 WORDS-DF2: Starts with all words tokenized by
ments (encoded by 16 different MUC-3 sites) as our cat- parts. Capitalization and syntactic class ignored.
egorization training documents, Category assignments Stopwords discarded based on syntactic tags. To-
should be quite consistent on our test set, but less so on kens consisting solely of digits and punctuation re-
our training set. moved. Words occurring in fewer than 2 training
documents removed. Total terms: 22,791.
3. Categorization Method 0 WC-MUTINFO-135: Starts with WORDS-DF2,
The statistical model used in our experiments was pro- and discards words occurring in fewer than 5 or
posed by Fuhr [5] for probabilistic text retrieval, but more than 1029 (7%) training documents. RNN
the adaptation to text categorization is straightforward. clustering used 135 metafeatures with value equal
Figure 1 shows the formula used. The model allows the to mutual information between presence of the word
possibility that the values of the binary features for a and presence of a manual indexing category. Result
document is not known with certainty, though that as- is 1,442 clusters and 8,506 singlets, for a total of
pect of the model was not used in our experiments. 9,948 terms.
0 PHRASE-DF2: Starts with all simple noun phrases
3.1. Binary Categorization bracketed by parts. Stopwords removed from
In order to compare text categorization output with an phrases based on tags. Single word phrases dis-
existing manual categorization we must replace probabil- carded. Numbers replaced with the token NUM-
ity estimates with explicit binary category assignments. BER. Phrases occurring in fewer than 2 training
Previous work on statistical text categorization has often documents removed. Total terms: 32,521.
ignored this step, or has not dealt with the case where 0 PC-W-GIVEN-C-44: Starts with PHRASE-DF2.
documents can have zero, one, or multiple correct cate-
Phrases occurring in fewer than 5 training docu-
gories. ments removed. RNN clustering uses 44 metafea-
Given accurate estimates of P ( C j = 11Dm), decision the- tures with value equal t o our estimate of P ( W =
ory tells us that the optimal strategy, assuming all errors 1IC = 1) for phrase W and category C. Result is
have equal cost, is to set a single threshold p and assign 1,883 clusters and 1,852 singlets, for a total of 3,735
Cj to a document exactly when P ( C j = llD,) >= p [6]. terms.
However, as is common in probabilistic models for text
classification tasks, the formula in Figure 1 makes as- Figure 2: Summary of indexing languages used with the
sumptions about the independence of probabilities which Reuters data set.
do not hold for textual data. The result is that the esti-
mates of P ( C j = llDm) can be quite inaccurate, as well
as inconsistent across categories and documents. that feature and assignment of that category. The top k
features for each category were chosen as its feature set,
We investigated several strategies for dealing with this and different values of k were investigated.
problem and settled on proportional assignment [4].
Each category is assigned to its top scoring documents
on the test set in a designated multiple of the percentage 4. Indexing Languages
of documents it was assigned to on the training corpus. We investigated phrasal and term clustering methods
Proportional assignment is not very satisfactory from a only on the Reuters collection, since the smaller amount
theoretical standpoint, since the probabilistic model is of text made the MUC-3 corpus less appropriate for clus-
supposed to already take into account the prior prob- tering experiments. For the MUC-3 data set a single in-
ability of a category. In tests the method was found dexing language consisting of 8,876 binary features was
to perform well as a standard decision tree induction tested, corresponding to all words occurring in 2 or more
method, however, so it is at least a plausible strategy. training documents. The original MUC-3 text was all
We are continuing to investigate other approaches. capitalized. Stop words were not removed.
For the Reuters data we adopted a conservative approach
3.2. Feature Selection
to syntactic phrase indexing. The phrasal indexing lan-
A primary concern of ours was to examine the effect guage consisted only of simple noun phrases, i.e. head
of feature set size on text categorization effectiveness. nouns and their immediate premodifiers. Phrases were
All potential features were ranked for each category by formed using parts, a stochastic syntactic class tagger
expected mutual information [7] between assignment of and simple noun phrase bracketing program [8]. Words
We estimate P ( C j = llDm) by:
P(Wi = ljCj = 1) is the probability that feature Wj is assigned t o a document given that we know category Cj
is assigned to that document. P(Wi = OICj = 1) is 1 - P(Wi = lICj = 1).
P(Wi = llDm) is the probability that feature Wi is assigned to document Dm.
All probabilities were estimated from the training corpus using the "add one" adjustment (the Jeffreys prior).
Reuters 0 .7
MUC-3
.6
.8
.5
B .E. .4
.6
.3(
Prec.
.2
.4
.1
.2 OI
0
0 .2 .4 .6 .8 1 Figure 4: Microaveraged breakeven points for feature
Recall sets of words on Reuters and MUC-3 test sets, and on
Reuters training set.
215
I I
.8 i i i i,vll I i i i I l l l l I I I I Illll'
words
word clusters 0 .7
phrases ©
.8 clusters [ .6
.51
.6 B.E. .4
Prec.
.3
words
.4 .2' phrases 0
.1
.2 0 , i , , , i , I , i , , ,i,* I I I I I Ill
1 10 100 1000
Dimensionality
0 I I I
0 .2 .4 .6 .8 1
Recall Figure 6: Microaveraged breakeven point for various
sized feature sets of words and phrases on Reuters test
set.
Figure 5: Microaveraged recall and precision on Reuters
test set for WORDS-DF2 words (10 features), WC-
MUTINFO-135 word clusters (10 features), PHRASE- (The various abbreviations and other oddities in the
DF2 phrases (180 features), and PC-W-GIVEN-C-44 phrases were present in the original text.) Many of the
phrase clusters (90 features). relationships captured in the clusters appear to be acci-
dental rather than the systematic semantic relationships
hoped for.
will be used in an appropriate fashion. It may be that
the inevitable errors in estimating probabilities from a W h y did phrase clustering fail? In earlier work on the
sample are more harmful when a feature is less strongly CACM collection [3], we identified lack of training data
associated with a category. as a primary impediment to high quality cluster forma-
tion. The Reuters corpus provided approximately 1.5
6.2. F e a t u r e E x t r a c t i o n million phrase occurrences, a factor of 25 more than
CACM. Still, it remains the case that the amount of data
The best results we obtained for each of the four ba- was insufficient to measure the distributional properties
sic representations on the Reuters test set are shown in
Figure 5. Individual terms in a phrasal representation
have, on the average, a lower frequency of appearance '8 investors service inc, < amo >
than terms in a word-based representation. So, not sur- N U M B E R accounts, slate regulators
prisingly, effectiveness of a phrasal representation peaks N U M B E R elections, N U M B E R engines
at a much higher feature set size (around 180 features) federal reserve chairman paul volcker,
than that of a word-based representation (see Figure 6). private consumption
More phrases are needed simply to make any distinc- additional N U M B E R dlrs, america >
tions among documents. Maximum effectiveness of the canadian bonds, cme board
phrasal representation is also substantially lower than denmark NUMBER, equivalent price
that of the word-based representation. Low frequency fund government-approved equity investments,
and high degree of synonymy outweigh the advantages fuji bank led
phrases have in lower ambiguity. its share price, new venture
new policy, representative offices
Disappointingly, as shown in Figure 5, term cluster- same-store sales, santa rosa
ing did not significantly improve the quality of either
a word-based or phrasal representation. Figure 7 shows Figure 7: Some representative PC-W-GIVEN-C-44 clus-
some representative PC-W-GIVEN-C-44 phrase clusters. ters.
216
of many phrases encountered. tions in isolation from word-based representations.
The definition of metafeatures is a key issue to recon- Like text categorization, text retrieval is a text classifi-
sider. Our original reasoning was that, since phrases cation task. The results shown here for text categoriza-
have low frequency, we should use metafeatures corre- tion, in particular the ineffectiveness of term clustering
sponding to bodies of text large enough that we could with coarse-grained metafeatures, are likely to hold for
expect cooccurrences of phrases within them. The poor text retrieval as well, though further experimentation is
quality of the clusters formed suggests that this ap- necessary.
proach is not effective. The use of such coarse-grained
metafeatures simply gives many opportunities for acci- 9. Acknowledgments
dental cooccurrences to arise, without providing a suffi-
Thanks to Bruce Croft, Paul Utgoff, Abe Bookstein and
cient constraint on the relationship between phrases (or
Marc Ringuette for helpful discussions. This research
words). The fact that clusters captured few high quality
was supported at U Mass Amherst by grant AFOSR-
semantic relationships, even when an extremely conser-
90-0110, and at the U Chicago by Ameritech. Many
vative clustering method was used, suggests that using
thanks to Phil Hayes, Carnegie Group, and Reuters for
other clustering methods with the same metafeature def-
making available the Reuters data, and to Ken Church
initions is not likely to be effective.
and A T & T for making available parts.
Finally, while phrases are less ambiguous than words,
they are not all good content indicators. Even restrict- References
ing phrase formation to simple noun phrases we see a 1. Hayes, P. and Weinstein, S. CONSTRUE/TIS: a system
substantial number of poor content indicators, and the for content-based indexing of a database of news stories.
impact of these are compounded when they are clustered In IAAIogO, 1990.
with better content indicators. 2. Sundheim, B., ed. Proceedings of the Third Message Un-
derstanding Evaluation and Conference, Morgan Kauf-
mann, Los Altos, CA, May 1991.
7. Future Work
3. Lewis, D. and Croft, W. Term clustering of syntactic
A great deal of research remains in developing text cat- phrases. In ACM SIGIR-90, pp. 385-404, 1990.
egorization methods. New approaches to setting appro- 4. Lewis, D. Representation and Learning in Information
priate category thresholds, estimating probabilities, and Retrieval. PhD thesis, Computer Science Dept.; Univ.
selecting features need to be investigated. For practical of Mass.; Amherst, MA, 1992. Technical Report 91-93.
systems, combinations of knowledge-based and statisti- 5. Fuhr, N. Models for retrieval with probabilistic index-
cal approaches are likely to be the best strategy. ing. Information Processing and Management, 25(1):55-
72, 1989.
On the text representation side, we continue to believe 6. Duda, It. and Hart, P. Pattern Classification and Scene
that forming groups of syntactic indexing phrases is an Analysis. Wiley-Interscience, New York, 1973.
effective route to better indexing languages. We be- 7. Hamming, It. Coding and Information Theory.
lieve the key will be supplementing statistical evidence Prentice-Hall, Englewood Cliffs, N J, 1980.
of phrase similarity with evidence from thesauri and 8. Church, K. A stochastic parts program and noun phrase
other knowledge sources, along with using metafeatures parser for unrestricted text. In Second Conference on
Applied NLP, pp. 136-143, 1988.
which provide tighter constraints on meaning. Cluster-
ing of words and phrases based on syntactic context is 9. Lewis, D. Evaluating text categorization. In Speech and
Natural Language Workshop, pp. 312-318, Feb. 1991.
a promising approach (see Strzalkowski in this volume).
Pruning out of low quality phrases is also likely to be 10. Fuhr, N., et ai. AIR/X--a rule-based multistage index-
ing system for large subject fields. In RIAO 91, pp.
important. 606-623, 1991.
11. Lewis, D. Data extraction as text categorization: An ex-
8. S u m m a r y periment with the MUC-3 corpus. In Proceedings MUC-
3, May 1991.
We have shown a statistical classifier trained on manu-
ally categorized documents to achieve quite effective per-
formance in assigning multiple, overlapping categories to
documents. We have also shown, via studying text cate-
gorization effectiveness, a variety of properties of index-
ing languages that are difficult or impossible to measure
directly in text retrieval experiments, such as effects of
feature set size and performance of phrasal representa-
217