Professional Documents
Culture Documents
Keywords: Automatic Text Categorization, Not much work has been done on Text Cat-
Telugu Corpus, Naive Bayes Classifier, Machine egorization in Indian languages [17, 4, 6, 5]. In
Learning this paper we describe the experiments we have
conducted on Telugu News Article Corpus devel-
oped by us at University of Hyderabad. The cor-
1 Introduction pus has 9870 files totalling to 27.5 Million words
(tokens) arising from about 16 Lakh types (dis-
Over the past decade, there has been an explo- tinct word forms). The documents are classified
sion in the availability of electronic information. into various categories such as Politics, Sports,
1
Business, Cinema, Health, Editorials, District
The research reported in this paper was supported
by the University Grants Commission under the research
News, Letters, etc. In this work documents from
project entitled “Language Engineering Research” under Politics, Sports, Business and Cinema are se-
the UPE scheme lected giving a set of 794 documents. The pur-
1
pose is to build a system that can automatically the various possible assignments and the fi-
classify a given document into one of these four nal decision about class assignments is left
categories with a high degree of accuracy. We to the user. This leads us to the possibility
follow a machine learning approach known as of semi-automatic or interactive classifiers
the Naive Bayes Classifier. The details of the where human users take the final decisions
experiments conducted and results obtained are to ensure highest levels of accuracy.
included.
• In Document Pivoted Categorization a given
document is to be assigned category la-
2 Text Categorization Defined bel(s) whereas in a Category Pivoted Cat-
egorization, all documents that belong to a
The aim of Automatic Text Categorization is given category must be identified. This dis-
to classify documents, typically based on the tinction is more pragmatic than conceptual.
subject matter or topic, without any manual Thus if all the documents are not available
effort. Automated text categorization can be to start with, document pivoted categoriza-
defined as assigning pre-defined category labels tion may be more appropriate while cate-
to new documents based on the likelihood sug- gory pivoted categorization may be the pre-
gested by a training set of labeled documents. ferred choice if new categories get added and
It is the task of assigning a value to each pair already classified documents need to be re-
(dj , ci ) ∈ DXC where D is a domain of docu- classified.
ments and C is a set of predefined categories.
• If only unlabeled training data is available
There are several variations to this theme: we may have to use unsupervised learning
techniques to perform Text Clustering in-
• Constraints may be imposed on the number stead of classification into known classes.
of categories that may be assigned to each
document - exactly k, at least k, at most k, In this work document pivoted single label
and so on. In the single label case, k = 1 and hard categorization based on labeled training
a single category is to be assigned to each data has been carried out on Telugu documents.
document. If k is more than 1, we have the
multi-label categorization. 3 Techniques for Text Catego-
• The text categorization problem can be re- rization
duced to a set of binary classification prob-
lems one for each category - where each doc- Before the 1990s, the predominant approach
ument is categorized into either ci orci . to text classification was the knowledge based
approach. With the increasing availability of
• In Hard categorization, the classifier is re- large scale data in electronic form, advances
quired to firmly assign categories to docu- in machine learning and statistical inference,
ments (or the other way around) whereas there has been a clear shift over the last
in Ranking Categorization, the system ranks decade or so in the approach towards automatic
learning from large scale data. In the Machine 3.1 Bayesian Learning Methods
Learning approach, a general inductive process
(also called the learner) automatically builds Bayesian Learning is a probabilistic approach
a classifier for a category ci by observing the to inference based on the assumption that the
characteristics of a set of training documents quantities of interest are governed by probability
already classified under ci orci . The inductive distributions and the optimal decision can be
process gleans from these labeled training data, made by reasoning about these probabilities
the characteristics that a new unseen document together with observed data.
should have in order to be classified under c i .
The classification problem is thus an activity of In Bayesian Learning methods a maximum
supervised learning. a posteriori (MAP) probability is computed
using the Bayes Theorem. The basic idea is to
use the joint probabilities of document terms
and categories to estimate the probabilities of
An increasing number of learning approaches categories given a document.
have been applied, including Regression Mod-
els [18, 24], Nearest Neighbor Classification
In some cases, the prior probabilities of all
[13, 23, 27, 25, 21], Bayesian Probabilistic
the hypotheses are assumed to be uniform
Approaches [20, 11, 15, 10, 9, 14, 3], Decision
and hence bracketed out. This assumption of
Trees [18, 11, 15, 9, 1], Inductive Rule Learning
uniform priors is questionable and has led to
[2, 7, 8, 16], Neural Networks [22, 19], On-line
criticism of the Bayesian approaches.
Learning [8, 12], and Support Vector Machines
[9]. Yang and Liu [26] have made a systematic
comparative study of several of these approaches Bayesian method requires the estimation of
and concluded that all methods perform compa- joint probabilities of all the features for each
rably when the distribution of documents across category. In order to simplify this, independence
categories is more or less uniform. is often assumed. That is, the conditional prob-
ability of a feature given a category is assumed
to be independent of the conditional probabil-
ities of other features given that category. A
It has been shown that Naive Bayes Classifier Bayesian classifier that makes this independence
can be used effectively for text categorization in assumption is termed a Naive Bayes Classifier.
Indian languages [4]. It was observed that bet- The Independence assumption is rarely valid in
ter training data where documents are properly real world. Yet the method works quite well
classified subject-wise would be highly desirable and is used in practice [11, 15, 10, 3, 14].
for further work in this area. In this paper we
apply the Naive Bayes classifier to Telugu News To categorize a test document dj as belonging
Article Corpus developed by us. Performance to a category Ci , the maximum likelihood is
obtained is comparable to published results for estimated over all categories:
other languages.
X large (often running into tens of thousands).
P (dj |Ci ) = log(P (w|Ci )) (1) The curse of dimensionality states that the
w∈dj number of training data samples required grows
The prior probabilities of each category exponentially with the number of features.
Prior(Ci ) are evaluated as the ratios of the Choice of the right subset of potential features
number of documents in category Ci to the is a major concern. A variety of dimensionality
number of documents in the total collection. reduction techniques are used in pattern recog-
nition but applying them for text categorization
Finally, the posterior probabilities of each cat- requires care. Stop word removal, Morphology
egory are calculated by adding the log likelihoods or Stemming, Identification of Phrases and
to the log priors. Collocations are some of the steps commonly
employed in text categorization to obtain more
discriminative features and/or to reduce the
P (Ci |dj ) = log(P (dj |Ci )) + log(P rior(Ci )) (2) number of features. It may be noted that these
methods are to a large extent language specific.
A test document is assigned the category with
the maximum posterior probability. To minimize In this work all the distinct words are taken
misclassification errors due to narrow differences, as features and documents are represented
a threshold value can be used to include a re- as vectors of these features. No morphology,
ject option. Performance can then be measured stemming or stop word removal is employed.
in terms of Precision, and Recall. In order to This gives us a base line performance. Analysis
capture the Precision-Recall trade-off in a sin- of the results will help in selection of better
gle quantity, a combined measure such as the features and judicious application of various
F-measure can be used. dimensionality reduction techniques either
by statistical methods or through the careful
4 Text Representation application of linguistic analyses.