You are on page 1of 6

TM19-1 SICE02-0163

Web News Classification Using Neural Networks


Based on PCA
Ali Selamat, Hidekazu Yanagimoto and Sigeru Omatu
Division of Computer and Systems Sciences,
Engineering Depament, Osaka Prefecture University.
Sakai, Osaka 599-8531. Japan.
aselamat~sig.cs.osaafu-u.ac.jp,
{ hidekazu,omatu:@cs.osakafu-u.ac.jp

Abstract: In this paper, we propam a news web page classification method (WPCM). The WPCM uses a neural
network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed
number of regular words from each class will be used as a feature vectors with the reduced features from the PCA.
These feature vectors are then used as the input to the neural networks for classification. The experimental
evaluation demonstratesthat the WPCM provides acceptable classificahon accuracy with the sports news datasets.

Keywords: text categorization,WWW. profile-based neural networks, principal campanent analysis.

and WPCM methods have been used as a benchmark


1. Introduction test for the classification accuracy. The experimental
evaluation demonstrates that the proposed method
Neural networks have been widely applied by many provides acceptable classification accuracy with the
researchers to classify the text documents with different sports news datasets. The organization of this paper is
types of feature vectors. W e m t e r has used the as follows: The news classification using the WPCM is
document title as the feature vectors to be used for described in Chapter 2. The preprocessing of web pages
document categorization ? Lam et al. have used the is explained in Chapter 3. The comparisons of web
principal component analysis (PCA) method as a feature pages classification using the PCA. TF-IDF. Bayesian,
reduction technique to the input data for the neural and WPCM methods are discussed in Chapter 4. In
networks ’I. However, if some original terms are Chapter 5. the discussions of the classificafim results
particularly good when discriminating a class category. using different methods for the sports news web pages
the discrimination power may be lost in the new vector are stated. In Chapter 6, we will conclude the
space after using the Latent Semantic Indexing (LSI) as classification accuracy by using the WPCM compared
described by Sebastini I). Retrieval techniques based on with other methods.
dimensionality reduction, such as LSI. have been shown
to improve the quality of the information being retrieved
by capturing the latent meaning of words present in the
documents. A limitation of PCA or LSI in supervised
2. Web Page Classification
data is the characteristic variables that describe smaller Method
classes tend to be lost as a result of the dimensionality

SIC€ 2002 Auk 57,2002, Osaka 2389 PROOOIiUZ/0000-0001MOOD2002SICEOOOl


~

SlCE02-0163
we have used the PCA algorithm to reduce the original TF,k is the number of how many times the distinct
data vectors to a small number of relevant features ".
Then we combine these features to the CPBF before word w k occurs in document Doci where
inputting them to the neural networks for classification
as shown in Fig. I .
k = 1, ...,m . Doci is referring to each web page
document that exist in the news database where
j = l,,.,,n . The calculation of the terms weight x j A of
each word w k is done by using a method that has been
used by Salton which is given by x l k = TF, x idfk
101
. The document frequency dfk is the total number of
documents in the database that contains a word w k ,
and the inverse document frequency
iffi=log($fk) where n is the total number of
documents in the database

4. Feature Reduction Using the


PCA
Suppose that we have A, which is a matrix document-
terms weight as below,

Fig 1 The classification process of a news web page


using the WPCM method.

3. Preprocessing of Web Pages


where xj, is the terms weight that exist in the
The classification process of a news web page using collection of documents. The definitions of
the WPCM method is shown in Fig. 1. It consists of web j , k , l , m and n have been described in previous
news retrieval process, stemming and stopping
paragraph. There are a few steps to be followed in order
processes, feature reduction process using o w proposed
method, and web classification process using I:HOI back- to calculate the principal components of data mauix A.
The mean of m variables in data matrix A will he
propagation neural networks. The retrieving process of
sports news web pages has been done by oljr software calculated as Fk = ~ ~ ~ = , x. j After
k ) that the
agent during night-time ". Only the latest spom news
covariance of matrix S = ( s j k ) is calculated. The
web pages category will be retrieved from the Yahoo
web server from the WWW. Then these web pages will covariance of si* is given by
be stored in the local news database. ARa that the
stemming and stopping processes of the tenns exist in sit = ~ I ~ ; . , ( X , - ~ ~where
~ Xi ,= -l ,~...,~n t ,. )
each document will take place. Stopping is a process of Then we determine the eigenvalues and eigenvectors of
removing the most frequent word that exist!; in a web the covariance mamx S which is a real symmetric
page document such as 'to', 'and', 'it', etc. Removing these positive matrix. An eigenvalue 1 and nonzero vector
words will save spaces for storing document contents e can be found such that, S e = k where e is an
and reduce time taken during the search process. eigenvector of S . In order to iind a nonzero vector e
Stemming is a process of extracting each word from a
the characteristic equation (S - AI1 = 0 must he solved.
web page document by reducing it to a possible mi
word. For example, the words 'compares', 'compared', If S is an m x m matrix of full rank, m eigenvalues
and 'comparing' have similar meaning with a word 2,,A2,...,A, can be found. By using ( S - U ) e = O ,
'compare'. We have used the Porter stemming algorithm all Corresponding eigenvectors can be found. The
to select only 'compare' to be used as a roo': word in a eigenvalues and corresponding eigenvectors will be
web page document I". Each term in the document will
be represented as a term frequency. Term frequency sorted so that S4 4 I.. .S A, . The eigenvector

2390
SICE02-0163
rnauix is representede =[u,,u2;.-,u,]. A diagonal
nonzero eigenvalue matrix is represented as

and

' TF
1+ -log 2
il=i Fb 'k
G, = (41
In order to get the principal components of matrix S,we log n
will perform eigenvalue decomposition that is given by
where n is the number of document in a collection and
S =eAeT . Then we select the first d <- m
eigenvectors where d is the desired vahe, e.g., 100,200, TF, is the term fiequency of each word in Doc,.
400, 600 etc. The set of principal components is The Fk is a frequency of term k in the entire
T T T
representedas Y, =e,x,Y2 =e,x,-..,Y, =e,x. document collection. We have selected R = 50 words
that have the highest enuopy value to be added to the
first d components from the PCA to be an input to the
neural networks for classification

Acc~rscyYI. Number of Features

1200
1500
ti o m
c
' ; 0600
0400
0200

.. ono0
00
. . . No.~&featu~ ~ .
I
Fig. 2 Accumulated proportion of principal components
generated by the PCA. Fig. 3 The classification accuracy with a combination
of the PCA and CPBF feature vectors.
4.1. Feature Selection Using the CPBF
For the feature selection using class profile-based 4.2. Input Data to the Neural
approach, we have manually identified the most regular Networks
words that exist in each category and weighted them
After the preprocessing of news web pages, a
using entropy weighting scheme before adding them to
vocabulary that contains all the unique words in the
the feature vectors that have been selected fiom the PCA.
news database has been created. We have limited the
For example, the words that exist regularly in a boxing
number of unique words in the vocabulary to 1,800 as
class are 'Lewis', Tyson', 'heavyweight', 'fighter',
the number of distinct words is big. Each of the word in
'knockout', etc. Then a fixed number of regular words
the vocahulary represents one feature vector. Each
from each class will be used as a feature vectors together
feature vector contains the document-terms weight. The
with the reduced principal components from the PCA.
high dimensionality of feature vectors to he as an input
These feature vectors are then used as the input to the
to the neural networks is not practical due to poor
neural networks for classification. The entropy.
scalability and performance. Therefore, the PCA has
weighting scheme on each term is calculated as
been used to reduce the original feature vectors
LjkXG, where Li is the local weighting of term k m = 1800 into a small number of principal components.
and G, is the global weighting of term k . The Ljkand In our case, we have selected a few values of d (e.g.,
100,200,300,400, and 500,600) together with R = 50
G, are given by features selected from the CPBF approach because this
parameter performs better for web news classification
compared to other parameters to he input to the neural
networks. The loading factor graph for the accumulated

2391
SICE02-0 163
proportion of eigenvalues is shown in Fig. 2. The value
of d = 5 0 contributes 81.6% of proportions from the (5)
original feature vectors. The neural network parameters
for the NN-PCA and the WPCM methods are shown in
Table I . We have selected 600 features from the PCA where P(cs) is the prior probability of class CS , and
and 50 features from CPBF methods. We have fcand that
this combination is the best in getting the accuracy of the 4w ' ) IS the
L ICY likelihood of a word wk in a class
classification as shown in Fig. 3. cs that is estimated on a labeled haining document. A
given web page document is then classified in a class
Table 1 The error back-propagation neural networks that maximizes the classification score. lfthe scores for
parameters that have been used in the experiments. all the classes in the sports category are less than a
given threshold, then the document is considered
NN Parameters unclassified. The Rainbow toolkit has been used for the
classification of the training and test news web pages
using the Bayesian and TF-IDFmethods l".

Table 2 The number of docnments that have been used


for classification.

The parameters of the error-backpropagation neural I Class I Class I Number of I


networks are shown in Table I . The numbers of inputs documents
to the neural networks are 650 with SO hidden layers and baseball
12 output layers. The try and error approach has been
boxin 210
used to select the appropriate value of hiddm layers
which indicate the highest value of document
classification accuracy. The numbers of neural networks football
outputs are based on the number of classes exist in the
sports news database as shown in Table 2. hocke 856
motor- orts
4.3. The TF-IDF Method
The TF-IDF classifier is based on the relevant feedback Skiin 233
algorithm using the vector space model ' I ) . The
algorithm represents documents as vectors so that the
documents with similar contents have similar vectors.
Each component of a vector corresponds to a term in the tennis
document typically a word. The weight of each Total 4096
component is calculated using the term frequency inverse
document frequency-weighting scheme (TF-IDF) which
tries to reward words tbat occur many times but m few
documents. To classify a new document Doc' , the 5. Experiments
cosines of the prototype vectors with comqonding
document vectors are calculated for each class,. Doc' is
assigned to the class which its document vector has the We have used a web page dataset from the Yahoo
highest cosine. Further description on the TF-IDF sports news as shown in Table 2. The types of news in
measure is described by Joachim I*). the database are baseball, boxing, cycling, football,
golf, hockey, motor-sports, rugby, skiing, soccer.
swimming and tennis. The total of documents are 40%.
4.4. The Bayesian Classifier For the training, we have selected randomly 1000
For statistical classification, we have used a standard documents from different classes. The rest of the
Bayesian classifier. when using the Bayesian classifier, documents are used as the test sets. For the PCA
we have assumed that term's occurrence is independent approach, we have selected 600 principal components
of other terms. We want to find a class cs that gives the and input them directly to the neural networks for
highest conditional probability given a document Doc' . classification. The PCA, TF-IDF, Bayesian, and WPCM
classification methods are evaluated using the standard
w: = w,,w z,...,w, is the set of words representing information retrieval measures that are precision, recall,
the textual content of the document Doc' and k is the and FI. They are defined as follows
.
term number where k = 1,2,. .,m . The classification
a
score is measured by precision = -
(a+ b)

2392
SICE02-0163

Precision Recall F1
Class No. (%) (%) (Yo)

( preri.xion +-recall J
1 97.14 68.00 80.00
2 86.96 100.00 93.02

where the values of U, b. and c are defined in Table 3.


The relationship between the system classification and
the expert judgment is expressed using fom values as 89.91 93.78
shown in Table 4. 6 1 94.29 1 99.00 1 96.59

Table 3 The definitions of the parameters U, b. and c


which are used in Table 4.

I category hut the &pert did.


c 1 The expert disagrees with the assigned
Table 7 The classification results using the F1 measure
d The system and the expert disagree with for the TF-IDF and Bayesian methods.
the assigned category
I I TF-IDF I Bayesian 1
Class No.

88.89 90.48
Expert System
Yes 1 No 100.00 100.Co
Yes U I b 79.39 18.79
No C d 86.86 81.75
6 68.03 86.89

Precision Recall Fl
Class No. (Oh) (Ye) (Oh)

1 1 89.82 1 100.00 I 94.64 88.46 80.77


2 1 84.75 1 100.00 I 91.74
3 1 71.60 I 72.50 1 72.05
4 1 97.06 I 99.00 I 98.02 78.06 81.07
97.00 98.48
94.74 5.1. Results
97.09 The classification results using the PCA and the
WPCM methods are shown in Table 5 and 6. The
99.01 99.01 average for precision, recall, and F1 measures using the
PCA approach are 87.79%, 92.04%, and 89.89%
71.26 67.07 respectively. I n comparison with the WPCM approach,
82.62 92.23 the precision, recall, and F1 measures are 90.35%.
12 I 69.44 1 75.00 1 72.12 93.81%, and 91.65% respectively. Fwthennare, we
Average I 87.79 1 92.04 1 89.89 have compared the WPCM with the TF-iDF and
Bayesian methods using the FI measure. They are
78.06% and 81.07% for the TF-1DF and Bayesian
methods as shown in Table 7. This indicates that if the
feature vectors are selected carefully, the improvement
of web sports news Classification using the combination
of the PCA and CPBF in the WPCM approach will

2393
SICE02-0163
increase the classification accuracy. The lugby and G. Karypis and E.H. Sam, Concept indexing: a
cycling classes (class number 8 and 3) contain a small fast dimensionality reduction algorithm with
number of documents in the dataset. The FI measure applications IO document retrieval &
using the PCA approach for the rugby ancl cycling categorization, ClKM 2000 (2000).
classes are 100% and 73% respectively, as :rhown in S. T. Dumais, Improving the retrieval of
Table 5 . But for the WPCM approach the classification information from extemal sources. Behavior
accuracies for both classes are 98.80% and 96.15% Research Methods, Instruments and Computers,
respectively. These indicate that the WPCM approach 23(2), 229-236, 1991.
has been able to classify the documents correctly Yahoo Web Pages, (http://www.yahoo.com),
although the number of documents representing;the class 2001.
is small. Also, we have overcome the limitation of the H.Mase and H.Tsuji. Experiments on automatic
PCA in supervised data where the characteristic variables web page categorization for information retrieval
that describe smaller classes tend to be lost as it result of system, Joumal of Information Processing, IPSJ
the dimensionality reduction by using the WPCM Journal, Feb. 2001, pg. 334-347,2001.
approach. Furthermore, the classification accuracy on the R. Calvo, M. Partridge, and M. Jabri, A
small classes can be improved although they have been comparative study of principal components
reduced into a small number of principal components. analysis techniques, In Proc. Ninth Australian
Conf. on Neural Networks, Brisbane. QLD., pp.
276-281, 1998.
A. Selamat, S. Omam H. Yanagimoto,
6. Conclusions Information retrieval from the internet using
mobile agent search system, International
We have presented a new approach of web news Conference of Information Technology and
classification using the WPCM. The feature selection for Multimedia 2001 (ICIMU), University Tenaga
the WPC.M is based on the combination of the: PCA and Nasional (UNITEN), Malaysia, August 13-15,
CPBF. The comparison of classification accrrracy using 2001.
the PCA, WPCM, TFIDF, and Bayesian approaches have [IO] Salton & McGill. Introduction to modem
been presented in this paper. The WPCM approach has information retrieval, New York, McGraw-Hill,
been applied to classify the Yahoo sports news web USA. 1983.
pages. The experimental evaluation with different [Ill S . Jones, Karen. and Peter Willet, Readings in
classification algorithms demonstrates that this method information retrieval, San Francisco: Morgan
provides acceptable classification accuracy with the Kaufinann, USA, 1997.
sports news datasets. Also, we have overrcome the [I21 T. Joachims, Probabilistic analysis of the
limitation of the PCA in supervised data where the Rocchio algorithm with TFIDF for text
characteristic variables that describe smaller classes tend categorization, Proceedings of lntemational
to be lost as a result of the dimensionality reduction by Conference on Machine Learning (ICML), 1997.
using the WPCM approach. Furthemiore, the [I31 A. K. McCallum, Bow: A toolkit for statistical
classification accuracy on the small classes can be language modeling, text retrieval, classifcation
improved although they have been reduced into a small and clustering,
number of principal components. Although the (bttp://www.cs.cmu.edu/-mccall~ow),1996.
classification accuracy using the WPCM approach is
high in comparison with other approaches, the time taken
for training is relatively long compared with other
methods.

7. References
S. Wermeter, Neural network agents for learning
semantic text classification, Information Retrieval,
Vol. 3. No. 2, p. 87-103, 2000.
S. L.Y. Lam and D. L. Lee, Feature n!duction for
neural network based text categorization,
Proceedings of the 6th International Conference
on Database Systems for Advanced Applications
-
19 22 April Hsinchu, Taiwan, 1999.
F. Sebastini, Machine learning in automated text
categorization, ACM Computing Surveys, 34(1),
2002.

2394

You might also like