Professional Documents
Culture Documents
Abstract: In this paper, we propam a news web page classification method (WPCM). The WPCM uses a neural
network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed
number of regular words from each class will be used as a feature vectors with the reduced features from the PCA.
These feature vectors are then used as the input to the neural networks for classification. The experimental
evaluation demonstratesthat the WPCM provides acceptable classificahon accuracy with the sports news datasets.
SlCE02-0163
we have used the PCA algorithm to reduce the original TF,k is the number of how many times the distinct
data vectors to a small number of relevant features ".
Then we combine these features to the CPBF before word w k occurs in document Doci where
inputting them to the neural networks for classification
as shown in Fig. I .
k = 1, ...,m . Doci is referring to each web page
document that exist in the news database where
j = l,,.,,n . The calculation of the terms weight x j A of
each word w k is done by using a method that has been
used by Salton which is given by x l k = TF, x idfk
101
. The document frequency dfk is the total number of
documents in the database that contains a word w k ,
and the inverse document frequency
iffi=log($fk) where n is the total number of
documents in the database
2390
SICE02-0163
rnauix is representede =[u,,u2;.-,u,]. A diagonal
nonzero eigenvalue matrix is represented as
and
' TF
1+ -log 2
il=i Fb 'k
G, = (41
In order to get the principal components of matrix S,we log n
will perform eigenvalue decomposition that is given by
where n is the number of document in a collection and
S =eAeT . Then we select the first d <- m
eigenvectors where d is the desired vahe, e.g., 100,200, TF, is the term fiequency of each word in Doc,.
400, 600 etc. The set of principal components is The Fk is a frequency of term k in the entire
T T T
representedas Y, =e,x,Y2 =e,x,-..,Y, =e,x. document collection. We have selected R = 50 words
that have the highest enuopy value to be added to the
first d components from the PCA to be an input to the
neural networks for classification
1200
1500
ti o m
c
' ; 0600
0400
0200
.. ono0
00
. . . No.~&featu~ ~ .
I
Fig. 2 Accumulated proportion of principal components
generated by the PCA. Fig. 3 The classification accuracy with a combination
of the PCA and CPBF feature vectors.
4.1. Feature Selection Using the CPBF
For the feature selection using class profile-based 4.2. Input Data to the Neural
approach, we have manually identified the most regular Networks
words that exist in each category and weighted them
After the preprocessing of news web pages, a
using entropy weighting scheme before adding them to
vocabulary that contains all the unique words in the
the feature vectors that have been selected fiom the PCA.
news database has been created. We have limited the
For example, the words that exist regularly in a boxing
number of unique words in the vocabulary to 1,800 as
class are 'Lewis', Tyson', 'heavyweight', 'fighter',
the number of distinct words is big. Each of the word in
'knockout', etc. Then a fixed number of regular words
the vocahulary represents one feature vector. Each
from each class will be used as a feature vectors together
feature vector contains the document-terms weight. The
with the reduced principal components from the PCA.
high dimensionality of feature vectors to he as an input
These feature vectors are then used as the input to the
to the neural networks is not practical due to poor
neural networks for classification. The entropy.
scalability and performance. Therefore, the PCA has
weighting scheme on each term is calculated as
been used to reduce the original feature vectors
LjkXG, where Li is the local weighting of term k m = 1800 into a small number of principal components.
and G, is the global weighting of term k . The Ljkand In our case, we have selected a few values of d (e.g.,
100,200,300,400, and 500,600) together with R = 50
G, are given by features selected from the CPBF approach because this
parameter performs better for web news classification
compared to other parameters to he input to the neural
networks. The loading factor graph for the accumulated
2391
SICE02-0 163
proportion of eigenvalues is shown in Fig. 2. The value
of d = 5 0 contributes 81.6% of proportions from the (5)
original feature vectors. The neural network parameters
for the NN-PCA and the WPCM methods are shown in
Table I . We have selected 600 features from the PCA where P(cs) is the prior probability of class CS , and
and 50 features from CPBF methods. We have fcand that
this combination is the best in getting the accuracy of the 4w ' ) IS the
L ICY likelihood of a word wk in a class
classification as shown in Fig. 3. cs that is estimated on a labeled haining document. A
given web page document is then classified in a class
Table 1 The error back-propagation neural networks that maximizes the classification score. lfthe scores for
parameters that have been used in the experiments. all the classes in the sports category are less than a
given threshold, then the document is considered
NN Parameters unclassified. The Rainbow toolkit has been used for the
classification of the training and test news web pages
using the Bayesian and TF-IDFmethods l".
2392
SICE02-0163
Precision Recall F1
Class No. (%) (%) (Yo)
( preri.xion +-recall J
1 97.14 68.00 80.00
2 86.96 100.00 93.02
88.89 90.48
Expert System
Yes 1 No 100.00 100.Co
Yes U I b 79.39 18.79
No C d 86.86 81.75
6 68.03 86.89
Precision Recall Fl
Class No. (Oh) (Ye) (Oh)
2393
SICE02-0163
increase the classification accuracy. The lugby and G. Karypis and E.H. Sam, Concept indexing: a
cycling classes (class number 8 and 3) contain a small fast dimensionality reduction algorithm with
number of documents in the dataset. The FI measure applications IO document retrieval &
using the PCA approach for the rugby ancl cycling categorization, ClKM 2000 (2000).
classes are 100% and 73% respectively, as :rhown in S. T. Dumais, Improving the retrieval of
Table 5 . But for the WPCM approach the classification information from extemal sources. Behavior
accuracies for both classes are 98.80% and 96.15% Research Methods, Instruments and Computers,
respectively. These indicate that the WPCM approach 23(2), 229-236, 1991.
has been able to classify the documents correctly Yahoo Web Pages, (http://www.yahoo.com),
although the number of documents representing;the class 2001.
is small. Also, we have overcome the limitation of the H.Mase and H.Tsuji. Experiments on automatic
PCA in supervised data where the characteristic variables web page categorization for information retrieval
that describe smaller classes tend to be lost as it result of system, Joumal of Information Processing, IPSJ
the dimensionality reduction by using the WPCM Journal, Feb. 2001, pg. 334-347,2001.
approach. Furthermore, the classification accuracy on the R. Calvo, M. Partridge, and M. Jabri, A
small classes can be improved although they have been comparative study of principal components
reduced into a small number of principal components. analysis techniques, In Proc. Ninth Australian
Conf. on Neural Networks, Brisbane. QLD., pp.
276-281, 1998.
A. Selamat, S. Omam H. Yanagimoto,
6. Conclusions Information retrieval from the internet using
mobile agent search system, International
We have presented a new approach of web news Conference of Information Technology and
classification using the WPCM. The feature selection for Multimedia 2001 (ICIMU), University Tenaga
the WPC.M is based on the combination of the: PCA and Nasional (UNITEN), Malaysia, August 13-15,
CPBF. The comparison of classification accrrracy using 2001.
the PCA, WPCM, TFIDF, and Bayesian approaches have [IO] Salton & McGill. Introduction to modem
been presented in this paper. The WPCM approach has information retrieval, New York, McGraw-Hill,
been applied to classify the Yahoo sports news web USA. 1983.
pages. The experimental evaluation with different [Ill S . Jones, Karen. and Peter Willet, Readings in
classification algorithms demonstrates that this method information retrieval, San Francisco: Morgan
provides acceptable classification accuracy with the Kaufinann, USA, 1997.
sports news datasets. Also, we have overrcome the [I21 T. Joachims, Probabilistic analysis of the
limitation of the PCA in supervised data where the Rocchio algorithm with TFIDF for text
characteristic variables that describe smaller classes tend categorization, Proceedings of lntemational
to be lost as a result of the dimensionality reduction by Conference on Machine Learning (ICML), 1997.
using the WPCM approach. Furthemiore, the [I31 A. K. McCallum, Bow: A toolkit for statistical
classification accuracy on the small classes can be language modeling, text retrieval, classifcation
improved although they have been reduced into a small and clustering,
number of principal components. Although the (bttp://www.cs.cmu.edu/-mccall~ow),1996.
classification accuracy using the WPCM approach is
high in comparison with other approaches, the time taken
for training is relatively long compared with other
methods.
7. References
S. Wermeter, Neural network agents for learning
semantic text classification, Information Retrieval,
Vol. 3. No. 2, p. 87-103, 2000.
S. L.Y. Lam and D. L. Lee, Feature n!duction for
neural network based text categorization,
Proceedings of the 6th International Conference
on Database Systems for Advanced Applications
-
19 22 April Hsinchu, Taiwan, 1999.
F. Sebastini, Machine learning in automated text
categorization, ACM Computing Surveys, 34(1),
2002.
2394