AN EFFICIENT TEXT

CLASSIFICATION USING
KNN AND NAIVE BAYESIAN

J.Sreemathy
Research Scholar
Karpagam University
Coimbatore, India
P. S. Balamurugan
Research Scholar
ANNA UNIVERSITY
Coimbatore, India

Abstract – The main objective is to propose a text classification based on the features selection and pre-
processing thereby reducing the dimensionality of the Feature vector and increase the classification
accuracy. Text classification is the process of assigning a document to one or more target categories,
based on its contents. In the proposed method, machine learning methods for text classification is used to
apply some text preprocessing methods in different dataset, and then to extract feature vectors for each
new document by using various feature weighting methods for enhancing the text classification accuracy.
Further training the classifier by Naive Bayesian (NB) and K-nearest neighbor (KNN) algorithms, the
predication can be made according to the category distribution among this k nearest neighbors.
Experimental results show that the methods are favorable in terms of their effectiveness and efficiency
when compared with other classifier such as SVM.
Keywords– Text classification; Feature selection; K-Nearest Neighbor; Naïve Bayesian
I. INTRODUCTION
The dimensionality of a feature vector is the major problem in Text classification. The dimensionality of
features is reduced by the technique called preprocessing. The preprocessing helps to identify the informative
features in order to improve the accuracy of the classifier. Unstructured database is preprocessed by stopping
keyword removal and stemming removes the obstacle words in order to improve the accuracy of the classifier.
There are some noise reduction techniques that work only for k-NN that can be effective in improving the
accuracy of the classifier. Naive Bayes can be used for both binary and multiclass classification problems.
II. FEATURE SELECTION
Feature selection is important task in classifying the text which improves the classification accuracy by
eliminating the noise features from various corpuses. Feature selection is defined as the process of selecting a
subset of relevant features for building robust learning models. Feature selection method has two main reasons.
Initially, it makes the training and applying a classifier to more efficient by decreasing the dimensionality of the
feature vector size.
III. FEATURE WEIGHTING
The feature weighting is the process of assigning a score for each feature i.e., term or a single word using a
score computing function. The features that have high scores are selected. The scores are assigned for each
feature in a selected subset of relevant features.  These mathematical definitions of the score-computing
functions are often defined by some probabilities which are estimated by some statistic information in the
documents across different categories. There are seven different weighting functions implemented.

A. Document Frequency (DF)
DF is the number of documents in which a term occurs. It is defined as
J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)
ISSN : 0975-3397 Vol. 4 No. 03 March 2012 392
1
( )
m
i
i
D F A
=
=
¿

B. Mutual Information (MI)
The mutual information of two random variables is a quantity that measures the mutual dependence of the
two random variables.  MI measures how much information the presence/absence of a term contributes to
making the correct classification decision.
{ } { } 1,0 1,0
( , )
( , ) ( , )ln
( ) ( )
k
k
k
f ck
k f c
f c
vf vc
P F v Ck V
MI F C p F v ck v
P F V P Ck V
e e
= =
= = =
= =
¿ ¿
 
C. Information Gain (IG)
Here both class membership and the presence/absence of a particular term are seen as random variables, and
one computes how much information about the class membership is gained by knowing the presence/absence
statistics (as is used in decision tree induction.
IG(t) = H(C) – H(C|T) = Στ,c P(C=c,T=τ)
ln [P(C=c, T=τ)/P(C=c)P(T=τ)].
Here, τ ranges over {present, absent} and c ranges over {c+, c–}.
D. _
2
Statistic (CHI)
Feature Selection by Chi - square testing is based on Pearson’s _
2
(chi square) tests. The Chi – square test
of independence helps to find out the variables X and Y are related to or independent of each other. In feature
selection, the Chi - square test measures the independence of a feature and a category. The null-hypothesis here
is that the feature and category are completely independent. It is defined by,
2
, ,
2
(( ) ( , , ))
( , )
,
k Ck Ck F Ck
F Ck
F C F F
k
F Ck
N N N N N
x F C
N N N
× × ÷ ×
=
× ×

E. NGL coefficient
The NGL coefficient presented in [NGL97] is a variant of the Chi square metric. It was originally named a
`correlation coefficient', but we follow Sebastiani [Seb02] and name it `NGL coefficient' after the last names of
the inventors Ng, Goh, and Low. The NGL coefficient looks only for evidence of positive class membership,
while the chi square metric also selects evidence of negative class membership.
( , , , , )
( , )
F
F Ck F Ck F Ck
k
F Ck Ck
N N N N
NGL F C
N N N N
÷
=
F. Term frequency Document frequency
The tf–idf weight is a method based on the term frequency combined with the document frequency
threshold, it is defined as,
1 2 1 3 2 3 ( ) ( ( )) TFDF F n n c n n n n = × + × + ×  
G. Gss coefficient
The GSS coefficient was originally presented in [GSS00] as a `simplified chi square function'. We follow
[Seb02] and name it GSS after the names on the inventors Galavotti, Sebastiani, and Simi.
( ) , , , , , F Ck F Ck F F Ck N N = ÷ Ck
k
Gss F c N N



J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)
ISSN : 0975-3397 Vol. 4 No. 03 March 2012 393
IV. TEXT CLASSIFICATION PROCESS
 
 
 









Figure 1. Block diagram for text classification
V. CLASSIFIERS AND DOCUMENT COLLECTIONS
We selected two classifiers for the Text Classification
- K-Nearest Neighbors
- Naive Bayesian
A. K – Nearest Neighbor Classifier
K-Nearest Neighbor is one of the most popular algorithms for text categorization. K-nearest neighbor
algorithm (k-NN) is a method for classifying objects based on closest training examples in the space. The
working of KNN can be detailed as follows first the test document has to be classified the KNN algorithm
searches the nearest neighbors among the training documents that are pre classified. The ranks for the K nearest
neighbors based on the similarity scores are calculate using some similarity measure such as Euclidean distance
measure etc., the distance between two neighbors using Euclidean distance can be found using the given
formula


the categories of the test document can be predicted using the ranked scores. The classification for the input
pattern is the class with the highest confidence; the performance of each learning model is tracked using the
validation technique called cross validation. The cross validation technique is used to validate the pre
determined metric like performance and accuracy.

2
1
( , ) ( )
D
i
D i s t X Y X i Yi
=
= ÷
¿
 
Score 
Test Document
Text
preprocessing
Reduce
the
Feature
vector
Feature
Selection
Method
KNN
Naive
Bayesian

Trained
Document
 
J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)
ISSN : 0975-3397 Vol. 4 No. 03 March 2012 394
B. Naive Bayesian Classifier
The naive Bayesian classifier is the probability based classifier, based on the features independent
probability value is calculated for each and every model. The naïve Bayesian classifier construct a set of class
i.e., set of category label and probability for that category. The scoring is done by ranking each category
probabilities for every test document rather than a training document for every test document. The classification
for the instance is the class with the highest probability value.
VI. PERFORMANCE METRIC
The evaluation of a classifier is done using the precision and recall measures .To derive a robust measure of
the effectiveness of the classifier It is able to calculate the breakeven point, the 11-point precision and ”average
precision” . to evaluate the classification for a threshold ranging from 0 (recall = 1) up to a value where the
precision value equals 1 and the recall value equals 0, incrementing the threshold with a given threshold step
size. The breakeven point is the point where recall meets precision and the eleven point precision is the averaged
value for the precision at the points where recall equals the eleven values 0.0, 0.1, 0.2... 0.9, 1.0. ”Average
precision” refines the eleven point precision, as it approximates the area”below” the precision/recall curve.

Table 1. Precision Recall measure across different classifiers


Figure 2. Performance chart of three different classifiers

Corpus Self made Reuters 21578
Terms Recall Precision Recall Precision
SVM 58.8% 75.4% 88.6% 57.9%
KNN 88.8% 80.4% 95.9% 85.5%
Naive
Bayesian
90.2% 93.1% 94.6% 95.5%
J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)
ISSN : 0975-3397 Vol. 4 No. 03 March 2012 395
VII. EXPERIMENTAL RESULTS

A. Data Set 1: Self Made
For the development used a small self-made corpus that contains standard categories such as ”Science”,
”Business”, ”Sports”, ”Health”, ”Education”, ”Travel”, and ”Movies”. It contains around 150 documents with
the above mentioned categories.
B. Data Set 2: The Reuters 21578 corpus
The second corpus included for the development is Reuters 21578 corpus. The corpus is freely available on
the internet (Lewis 1997). Uses an XML parser, it was necessary to convert the 22 SGML documents to XML,
using the freely available tool SX (Clark 2001). After the conversion I deleted some single characters which
were rejected by the validating XML parser as they had decimal values below 30. This does not affect the
results since the characters would have been considered as whitespaces anyway.
VIII. CONCLUSION
Analyzed the text classification using the Naive Bayesian and K-Nearest Neighbor classification, the
methods are favorable in terms of their effectiveness and efficiency when compared with other classifier such as
SVM. The advantage of the proposed approach is, the classification algorithm learns importance of attributes
and utilizes them in the similarity measure. In future the classification model can be build, which analyzes terms
on the concept sentence in document.
REFERENCES
[1] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification“Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee,
Member, IEEE TRANS ON Knowledge and Data Eng., Vol 23, No.3, March 2011.
[2] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen, “Effective and Efficient Dimensionality Reduction
for Large-Scale and Streaming Data Preprocessing,” IEEE Trans.Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.
[3] Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” T. Sebastian, S.Lawrence, and S.
Bernhard eds. Advances in Neural Information Processing System, pp. 97-104, Springer, 2004.
[4] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp.
212-217, 1992.
[5] http://kdd.ics.uci.edu/databases/reuters21578/reuters21578
[6] .html. 2010.
[7] Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification with Support Vector Machines,” J. Machine Learning
Research, vol. 6, pp. 37-53, 2005.
[8] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[9] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares,” BIT
Numerical Math, vol. 43, pp. 427-448, 2003.

AUTHORS PROFILE
Sreemathy J received BE degree at 2007 in Computer Science & Engineering from Park College of
Engineering and Technology, Coimbatore and currently pursuing M.E. Computer Science & Engineering at
Karpagam University, Coimbatore.

Prof. P.S.Balamurugan received BE degree in 2003 from Bharathiyar University Coimbatore and ME in 2005
from Anna University. Currently he is pursuing his Doctoral degree at Anna University, Coimbatore.
J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)
ISSN : 0975-3397 Vol. 4 No. 03 March 2012 396

c k   N F . Ck . / International Journal on Computer Science and Engineering (IJCSE) D F   m ( A i) i1 B.0vck 1. it is defined as. The null-hypothesis here is that the feature and category are completely independent. D. IG(t) = H(C) – H(C|T) = Στ.square testing is based on Pearson’s  (chi square) tests. Ck)    vf  1. TFDF ( F )  (n1  n2  c(n1  n3  n2  n3))   G. Goh. Ck N F . Ck  NCk 2 E. ck  v )ln P(F  V )P(Ck  V ) f ck f ck P(F  vf . but we follow Sebastiani [Seb02] and name it `NGL coefficient' after the last names of the inventors Ng. The NGL coefficient looks only for evidence of positive class membership. Ck ) NFN F NCkN Ck F. Ck  N F .c P(C=c. N F . while the chi square metric also selects evidence of negative class membership. In feature selection. It is defined by. Sebastiani.Sreemathy et al. Ck  Vck)   C. Here. Ck N F . Ck  NF. The Chi – square test of independence helps to find out the variables X and Y are related to or independent of each other. Ck  NF. Ck) (NF.T=τ) ln [P(C=c. absent} and c ranges over {c+. NGL coefficient The NGL coefficient presented in [NGL97] is a variant of the Chi square metric.square test measures the independence of a feature and a category. 4 No. and Low.Ck)  NF  NF. Ck ISSN : 0975-3397 Vol. Information Gain (IG) Here both class membership and the presence/absence of a particular term are seen as random variables.J. We follow [Seb02] and name it GSS after the names on the inventors Galavotti. Term frequency Document frequency The tf–idf weight is a method based on the term frequency combined with the document frequency threshold. and Simi. Ck )  N  NF .  Statistic (CHI) Feature Selection by Chi . It was originally named a `correlation coefficient'. G ss  F . Ck  NF . Mutual Information (MI) The mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables.  MI measures how much information the presence/absence of a term contributes to making the correct classification decision. MI (F. 03 March 2012 393 . τ ranges over {present. NGL( F . Ck))2 x (F. N ((NF. and one computes how much information about the class membership is gained by knowing the presence/absence statistics (as is used in decision tree induction. the Chi . c–}. Gss coefficient The GSS coefficient was originally presented in [GSS00] as a `simplified chi square function'.0  p(F  v . T=τ)/P(C=c)P(T=τ)].

The working of KNN can be detailed as follows first the test document has to be classified the KNN algorithm searches the nearest neighbors among the training documents that are pre classified. 4 No. The cross validation technique is used to validate the pre determined metric like performance and accuracy.J. The ranks for the K nearest neighbors based on the similarity scores are calculate using some similarity measure such as Euclidean distance measure etc. 03 March 2012 394 .   K-Nearest Neighbors Naive Bayesian CLASSIFIERS AND DOCUMENT COLLECTIONS We selected two classifiers for the Text Classification A. ISSN : 0975-3397 Vol.. The classification for the input pattern is the class with the highest confidence. K – Nearest Neighbor Classifier K-Nearest Neighbor is one of the most popular algorithms for text categorization. K-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the space.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE) IV. Y )   D 2 ( X i  Yi) i 1 the categories of the test document can be predicted using the ranked scores. the distance between two neighbors using Euclidean distance can be found using the given formula D is t ( X . TEXT CLASSIFICATION PROCESS     Test Document     Text preprocessing Feature Selection Method Reduce the Feature vector KNN Score  Naive Bayesian Trained Document   Figure 1. the performance of each learning model is tracked using the validation technique called cross validation. Block diagram for text classification V.

”Average precision” refines the eleven point precision. The naïve Bayesian classifier construct a set of class i.1.9% 85. as it approximates the area”below” the precision/recall curve.. Precision Recall measure across different classifiers Corpus Terms Recall Self made Precision Reuters 21578 Recall Precision SVM KNN Naive Bayesian 58.8% 90.0. the 11-point precision and ”average precision” . 1.2. set of category label and probability for that category.1% 88.4% 93.2% 75. based on the features independent probability value is calculated for each and every model. 4 No.To derive a robust measure of the effectiveness of the classifier It is able to calculate the breakeven point.J.9% 94. 0.6% 57..8% 88. The classification for the instance is the class with the highest probability value. PERFORMANCE METRIC The evaluation of a classifier is done using the precision and recall measures . incrementing the threshold with a given threshold step size. VI. 0.. 0.Sreemathy et al. to evaluate the classification for a threshold ranging from 0 (recall = 1) up to a value where the precision value equals 1 and the recall value equals 0. Table 1. / International Journal on Computer Science and Engineering (IJCSE) B. The scoring is done by ranking each category probabilities for every test document rather than a training document for every test document.4% 80.0. Performance chart of three different classifiers ISSN : 0975-3397 Vol. The breakeven point is the point where recall meets precision and the eleven point precision is the averaged value for the precision at the points where recall equals the eleven values 0. Naive Bayesian Classifier The naive Bayesian classifier is the probability based classifier.9. 03 March 2012 395 .e.5% Figure 2.6% 95.5% 95.

320-333. 1-47.S. M. http://kdd.E.Balamurugan received BE degree in 2003 from Bharathiyar University Coimbatore and ME in 2005 from Anna University. and K. 2010. 1992. Chen. The corpus is freely available on the internet (Lewis 1997). the methods are favorable in terms of their effectiveness and efficiency when compared with other classifier such as SVM..ics. vol.” IEEE Trans. and H. No. Member. Coimbatore.” T.edu/databases/reuters21578/reuters21578 . 2006. pp. B. Advances in Neural Information Processing System. Springer. After the conversion I deleted some single characters which were rejected by the validating XML parser as they had decimal values below 30. Q. Fan. W. Cheng. 97-104. Yang. Coimbatore and currently pursuing M. “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares. 37-53. Liu. Li. “Dimension Reduction in Text Classification with Support Vector Machines.Knowledge and Data Eng. and J. S. Q. This does not affect the results since the characters would have been considered as whitespaces anyway. Data Set 2: The Reuters 21578 corpus The second corpus included for the development is Reuters 21578 corpus. 427-448. Machine Learning Research. EXPERIMENTAL RESULTS A. / International Journal on Computer Science and Engineering (IJCSE) VII. ”Health”. Kim. Zhang.3. the classification algorithm learns importance of attributes and utilizes them in the similarity measure. Ren-Jia Liou. Vol 23. 4 No. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification“Jung-Yi Jiang. it was necessary to convert the 22 SGML documents to XML. 1. vol.Sreemathy et al. Uses an XML parser. Park. pp. Mar. AUTHORS PROFILE Sreemathy J received BE degree at 2007 in Computer Science & Engineering from Park College of Engineering and Technology. Workshop Speech and Natural Language. Howland. ”Travel”. and Z. 2004. Rosen. ”Education”. ISSN : 0975-3397 Vol.. B. It contains around 150 documents with the above mentioned categories. S. Data Set 1: Self Made For the development used a small self-made corpus that contains standard categories such as ”Science”. P. 2002.Lawrence. W. 18. Xi. 34. T. J.” J. March 2011. no. 212-217. Prof. Sebastian. N.” BIT Numerical Math. pp. vol. Coimbatore. Park. The advantage of the proposed approach is. “Feature Selection and Feature Extraction for Text Categorization. Computer Science & Engineering at Karpagam University. P. using the freely available tool SX (Clark 2001). Yan. In future the classification model can be build. Yan. CONCLUSION Analyzed the text classification using the Naive Bayesian and K-Nearest Neighbor classification. 2003. pp.uci. “Machine Learning in Automated Text Categorization. “Efficient and Robust Feature Extraction by Maximum Margin Criterion. which analyzes terms on the concept sentence in document. and S.” Proc. “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing.” ACM Computing Surveys. 43. VIII. Sebastiani.html. Currently he is pursuing his Doctoral degree at Anna University. Bernhard eds. Zang. no. H. Jeon. 3. and Shie-Jue Lee.J. pp. D. and ”Movies”. ”Sports”. 6. IEEE TRANS ON Knowledge and Data Eng. vol. 2005. ”Business”. Lewis. pp.D. F. Jiang. 03 March 2012 396 .

Sign up to vote on this title
UsefulNot useful