This action might not be possible to undo. Are you sure you want to continue?

# AN EFFICIENT TEXT

CLASSIFICATION USING

KNN AND NAIVE BAYESIAN

J.Sreemathy

Research Scholar

Karpagam University

Coimbatore, India

P. S. Balamurugan

Research Scholar

ANNA UNIVERSITY

Coimbatore, India

Abstract – The main objective is to propose a text classification based on the features selection and pre-

processing thereby reducing the dimensionality of the Feature vector and increase the classification

accuracy. Text classification is the process of assigning a document to one or more target categories,

based on its contents. In the proposed method, machine learning methods for text classification is used to

apply some text preprocessing methods in different dataset, and then to extract feature vectors for each

new document by using various feature weighting methods for enhancing the text classification accuracy.

Further training the classifier by Naive Bayesian (NB) and K-nearest neighbor (KNN) algorithms, the

predication can be made according to the category distribution among this k nearest neighbors.

Experimental results show that the methods are favorable in terms of their effectiveness and efficiency

when compared with other classifier such as SVM.

Keywords– Text classification; Feature selection; K-Nearest Neighbor; Naïve Bayesian

I. INTRODUCTION

The dimensionality of a feature vector is the major problem in Text classification. The dimensionality of

features is reduced by the technique called preprocessing. The preprocessing helps to identify the informative

features in order to improve the accuracy of the classiﬁer. Unstructured database is preprocessed by stopping

keyword removal and stemming removes the obstacle words in order to improve the accuracy of the classiﬁer.

There are some noise reduction techniques that work only for k-NN that can be effective in improving the

accuracy of the classiﬁer. Naive Bayes can be used for both binary and multiclass classification problems.

II. FEATURE SELECTION

Feature selection is important task in classifying the text which improves the classification accuracy by

eliminating the noise features from various corpuses. Feature selection is defined as the process of selecting a

subset of relevant features for building robust learning models. Feature selection method has two main reasons.

Initially, it makes the training and applying a classifier to more efficient by decreasing the dimensionality of the

feature vector size.

III. FEATURE WEIGHTING

The feature weighting is the process of assigning a score for each feature i.e., term or a single word using a

score computing function. The features that have high scores are selected. The scores are assigned for each

feature in a selected subset of relevant features. These mathematical definitions of the score-computing

functions are often defined by some probabilities which are estimated by some statistic information in the

documents across different categories. There are seven different weighting functions implemented.

A. Document Frequency (DF)

DF is the number of documents in which a term occurs. It is defined as

J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)

ISSN : 0975-3397 Vol. 4 No. 03 March 2012 392

1

( )

m

i

i

D F A

=

=

¿

B. Mutual Information (MI)

The mutual information of two random variables is a quantity that measures the mutual dependence of the

two random variables. MI measures how much information the presence/absence of a term contributes to

making the correct classification decision.

{ } { } 1,0 1,0

( , )

( , ) ( , )ln

( ) ( )

k

k

k

f ck

k f c

f c

vf vc

P F v Ck V

MI F C p F v ck v

P F V P Ck V

e e

= =

= = =

= =

¿ ¿

C. Information Gain (IG)

Here both class membership and the presence/absence of a particular term are seen as random variables, and

one computes how much information about the class membership is gained by knowing the presence/absence

statistics (as is used in decision tree induction.

IG(t) = H(C) – H(C|T) = Στ,c P(C=c,T=τ)

ln [P(C=c, T=τ)/P(C=c)P(T=τ)].

Here, τ ranges over {present, absent} and c ranges over {c+, c–}.

D. _

2

Statistic (CHI)

Feature Selection by Chi - square testing is based on Pearson’s _

2

(chi square) tests. The Chi – square test

of independence helps to find out the variables X and Y are related to or independent of each other. In feature

selection, the Chi - square test measures the independence of a feature and a category. The null-hypothesis here

is that the feature and category are completely independent. It is defined by,

2

, ,

2

(( ) ( , , ))

( , )

,

k Ck Ck F Ck

F Ck

F C F F

k

F Ck

N N N N N

x F C

N N N

× × ÷ ×

=

× ×

E. NGL coefficient

The NGL coefficient presented in [NGL97] is a variant of the Chi square metric. It was originally named a

`correlation coefficient', but we follow Sebastiani [Seb02] and name it `NGL coefficient' after the last names of

the inventors Ng, Goh, and Low. The NGL coefficient looks only for evidence of positive class membership,

while the chi square metric also selects evidence of negative class membership.

( , , , , )

( , )

F

F Ck F Ck F Ck

k

F Ck Ck

N N N N

NGL F C

N N N N

÷

=

F. Term frequency Document frequency

The tf–idf weight is a method based on the term frequency combined with the document frequency

threshold, it is defined as,

1 2 1 3 2 3 ( ) ( ( )) TFDF F n n c n n n n = × + × + ×

G. Gss coefficient

The GSS coefficient was originally presented in [GSS00] as a `simplified chi square function'. We follow

[Seb02] and name it GSS after the names on the inventors Galavotti, Sebastiani, and Simi.

( ) , , , , , F Ck F Ck F F Ck N N = ÷ Ck

k

Gss F c N N

J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)

ISSN : 0975-3397 Vol. 4 No. 03 March 2012 393

IV. TEXT CLASSIFICATION PROCESS

Figure 1. Block diagram for text classification

V. CLASSIFIERS AND DOCUMENT COLLECTIONS

We selected two classifiers for the Text Classification

- K-Nearest Neighbors

- Naive Bayesian

A. K – Nearest Neighbor Classifier

K-Nearest Neighbor is one of the most popular algorithms for text categorization. K-nearest neighbor

algorithm (k-NN) is a method for classifying objects based on closest training examples in the space. The

working of KNN can be detailed as follows first the test document has to be classified the KNN algorithm

searches the nearest neighbors among the training documents that are pre classified. The ranks for the K nearest

neighbors based on the similarity scores are calculate using some similarity measure such as Euclidean distance

measure etc., the distance between two neighbors using Euclidean distance can be found using the given

formula

the categories of the test document can be predicted using the ranked scores. The classification for the input

pattern is the class with the highest confidence; the performance of each learning model is tracked using the

validation technique called cross validation. The cross validation technique is used to validate the pre

determined metric like performance and accuracy.

2

1

( , ) ( )

D

i

D i s t X Y X i Yi

=

= ÷

¿

Score

Test Document

Text

preprocessing

Reduce

the

Feature

vector

Feature

Selection

Method

KNN

Naive

Bayesian

Trained

Document

J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)

ISSN : 0975-3397 Vol. 4 No. 03 March 2012 394

B. Naive Bayesian Classifier

The naive Bayesian classifier is the probability based classifier, based on the features independent

probability value is calculated for each and every model. The naïve Bayesian classifier construct a set of class

i.e., set of category label and probability for that category. The scoring is done by ranking each category

probabilities for every test document rather than a training document for every test document. The classification

for the instance is the class with the highest probability value.

VI. PERFORMANCE METRIC

The evaluation of a classifier is done using the precision and recall measures .To derive a robust measure of

the effectiveness of the classifier It is able to calculate the breakeven point, the 11-point precision and ”average

precision” . to evaluate the classification for a threshold ranging from 0 (recall = 1) up to a value where the

precision value equals 1 and the recall value equals 0, incrementing the threshold with a given threshold step

size. The breakeven point is the point where recall meets precision and the eleven point precision is the averaged

value for the precision at the points where recall equals the eleven values 0.0, 0.1, 0.2... 0.9, 1.0. ”Average

precision” refines the eleven point precision, as it approximates the area”below” the precision/recall curve.

Table 1. Precision Recall measure across different classifiers

Figure 2. Performance chart of three different classifiers

Corpus Self made Reuters 21578

Terms Recall Precision Recall Precision

SVM 58.8% 75.4% 88.6% 57.9%

KNN 88.8% 80.4% 95.9% 85.5%

Naive

Bayesian

90.2% 93.1% 94.6% 95.5%

J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)

ISSN : 0975-3397 Vol. 4 No. 03 March 2012 395

VII. EXPERIMENTAL RESULTS

A. Data Set 1: Self Made

For the development used a small self-made corpus that contains standard categories such as ”Science”,

”Business”, ”Sports”, ”Health”, ”Education”, ”Travel”, and ”Movies”. It contains around 150 documents with

the above mentioned categories.

B. Data Set 2: The Reuters 21578 corpus

The second corpus included for the development is Reuters 21578 corpus. The corpus is freely available on

the internet (Lewis 1997). Uses an XML parser, it was necessary to convert the 22 SGML documents to XML,

using the freely available tool SX (Clark 2001). After the conversion I deleted some single characters which

were rejected by the validating XML parser as they had decimal values below 30. This does not affect the

results since the characters would have been considered as whitespaces anyway.

VIII. CONCLUSION

Analyzed the text classification using the Naive Bayesian and K-Nearest Neighbor classification, the

methods are favorable in terms of their effectiveness and efficiency when compared with other classifier such as

SVM. The advantage of the proposed approach is, the classification algorithm learns importance of attributes

and utilizes them in the similarity measure. In future the classification model can be build, which analyzes terms

on the concept sentence in document.

REFERENCES

[1] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification“Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee,

Member, IEEE TRANS ON Knowledge and Data Eng., Vol 23, No.3, March 2011.

[2] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen, “Effective and Efficient Dimensionality Reduction

for Large-Scale and Streaming Data Preprocessing,” IEEE Trans.Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.

[3] Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” T. Sebastian, S.Lawrence, and S.

Bernhard eds. Advances in Neural Information Processing System, pp. 97-104, Springer, 2004.

[4] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp.

212-217, 1992.

[5] http://kdd.ics.uci.edu/databases/reuters21578/reuters21578

[6] .html. 2010.

[7] Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification with Support Vector Machines,” J. Machine Learning

Research, vol. 6, pp. 37-53, 2005.

[8] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[9] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares,” BIT

Numerical Math, vol. 43, pp. 427-448, 2003.

AUTHORS PROFILE

Sreemathy J received BE degree at 2007 in Computer Science & Engineering from Park College of

Engineering and Technology, Coimbatore and currently pursuing M.E. Computer Science & Engineering at

Karpagam University, Coimbatore.

Prof. P.S.Balamurugan received BE degree in 2003 from Bharathiyar University Coimbatore and ME in 2005

from Anna University. Currently he is pursuing his Doctoral degree at Anna University, Coimbatore.

J.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE)

ISSN : 0975-3397 Vol. 4 No. 03 March 2012 396

c k N F . Ck . / International Journal on Computer Science and Engineering (IJCSE) D F m ( A i) i1 B.0vck 1. it is defined as. The null-hypothesis here is that the feature and category are completely independent. D. IG(t) = H(C) – H(C|T) = Στ.square testing is based on Pearson’s (chi square) tests. Ck) vf 1. TFDF ( F ) (n1 n2 c(n1 n3 n2 n3)) G. Goh. Ck N F . Ck NCk 2 E. ck v )ln P(F V )P(Ck V ) f ck f ck P(F vf . but we follow Sebastiani [Seb02] and name it `NGL coefficient' after the last names of the inventors Ng. The NGL coefficient looks only for evidence of positive class membership. Ck ) NFN F NCkN Ck F. Ck N F .c P(C=c. N F . while the chi square metric also selects evidence of negative class membership. In feature selection. It is defined by. Sebastiani.Sreemathy et al. Ck Vck) C. Here. Ck N F . Ck NF. The Chi – square test of independence helps to find out the variables X and Y are related to or independent of each other. Ck NF. Ck) (NF.T=τ) ln [P(C=c. absent} and c ranges over {c+. NGL coefficient The NGL coefficient presented in [NGL97] is a variant of the Chi square metric.square test measures the independence of a feature and a category. 4 No. and Low.Ck) NF NF. Ck ISSN : 0975-3397 Vol. Information Gain (IG) Here both class membership and the presence/absence of a particular term are seen as random variables.J. We follow [Seb02] and name it GSS after the names on the inventors Galavotti. Term frequency Document frequency The tf–idf weight is a method based on the term frequency combined with the document frequency threshold. and Simi. Ck ) N NF . Statistic (CHI) Feature Selection by Chi . It was originally named a `correlation coefficient'. G ss F . Ck NF . Mutual Information (MI) The mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables. MI measures how much information the presence/absence of a term contributes to making the correct classification decision. MI (F. 03 March 2012 393 . τ ranges over {present. NGL( F . Ck))2 x (F. N ((NF. and one computes how much information about the class membership is gained by knowing the presence/absence statistics (as is used in decision tree induction. the Chi . c–}. Gss coefficient The GSS coefficient was originally presented in [GSS00] as a `simplified chi square function'.0 p(F v . T=τ)/P(C=c)P(T=τ)].

The working of KNN can be detailed as follows first the test document has to be classified the KNN algorithm searches the nearest neighbors among the training documents that are pre classified. 4 No. The cross validation technique is used to validate the pre determined metric like performance and accuracy.J. The ranks for the K nearest neighbors based on the similarity scores are calculate using some similarity measure such as Euclidean distance measure etc. 03 March 2012 394 . K-Nearest Neighbors Naive Bayesian CLASSIFIERS AND DOCUMENT COLLECTIONS We selected two classifiers for the Text Classification A. ISSN : 0975-3397 Vol.. The classification for the input pattern is the class with the highest confidence. K – Nearest Neighbor Classifier K-Nearest Neighbor is one of the most popular algorithms for text categorization. K-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the space.Sreemathy et al. / International Journal on Computer Science and Engineering (IJCSE) IV. Y ) D 2 ( X i Yi) i 1 the categories of the test document can be predicted using the ranked scores. the distance between two neighbors using Euclidean distance can be found using the given formula D is t ( X . TEXT CLASSIFICATION PROCESS Test Document Text preprocessing Feature Selection Method Reduce the Feature vector KNN Score Naive Bayesian Trained Document Figure 1. the performance of each learning model is tracked using the validation technique called cross validation. Block diagram for text classification V.

”Average precision” refines the eleven point precision. The naïve Bayesian classifier construct a set of class i.1.9% 85. as it approximates the area”below” the precision/recall curve.. Precision Recall measure across different classifiers Corpus Terms Recall Self made Precision Reuters 21578 Recall Precision SVM KNN Naive Bayesian 58.8% 90.0. the 11-point precision and ”average precision” . 1.2. set of category label and probability for that category.1% 88.4% 93.2% 75. based on the features independent probability value is calculated for each and every model. 4 No.To derive a robust measure of the effectiveness of the classifier It is able to calculate the breakeven point.J.9% 94. 0.6% 57..8% 88. The classification for the instance is the class with the highest probability value. PERFORMANCE METRIC The evaluation of a classifier is done using the precision and recall measures . incrementing the threshold with a given threshold step size. VI. 0.. 0.Sreemathy et al. to evaluate the classification for a threshold ranging from 0 (recall = 1) up to a value where the precision value equals 1 and the recall value equals 0. Table 1. / International Journal on Computer Science and Engineering (IJCSE) B. The scoring is done by ranking each category probabilities for every test document rather than a training document for every test document.4% 80.0. Performance chart of three different classifiers ISSN : 0975-3397 Vol. The breakeven point is the point where recall meets precision and the eleven point precision is the averaged value for the precision at the points where recall equals the eleven values 0. Naive Bayesian Classifier The naive Bayesian classifier is the probability based classifier.9. 03 March 2012 395 .e.5% Figure 2.6% 95.5% 95.

320-333. 1-47.S. M. http://kdd.E.Balamurugan received BE degree in 2003 from Bharathiyar University Coimbatore and ME in 2005 from Anna University. and K. 2010. 1992. Chen. The corpus is freely available on the internet (Lewis 1997). the methods are favorable in terms of their effectiveness and efficiency when compared with other classifier such as SVM..ics. vol.” IEEE Trans. and H. No. Member. Coimbatore.” T.edu/databases/reuters21578/reuters21578 . 2006. pp. B. Advances in Neural Information Processing System. Springer. After the conversion I deleted some single characters which were rejected by the validating XML parser as they had decimal values below 30. Q. Fan. W. Cheng. 97-104. Yang. Coimbatore and currently pursuing M. “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares. 37-53. Liu. Li. “Dimension Reduction in Text Classification with Support Vector Machines.Knowledge and Data Eng. and J. S. Q. This does not affect the results since the characters would have been considered as whitespaces anyway. Data Set 2: The Reuters 21578 corpus The second corpus included for the development is Reuters 21578 corpus. 427-448. Machine Learning Research. EXPERIMENTAL RESULTS A. / International Journal on Computer Science and Engineering (IJCSE) VII. ”Health”. Kim. Zhang.3. the classification algorithm learns importance of attributes and utilizes them in the similarity measure. Ren-Jia Liou. Vol 23. 4 No. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification“Jung-Yi Jiang. it was necessary to convert the 22 SGML documents to XML. 1. vol.Sreemathy et al. Uses an XML parser. Park. pp. Mar. AUTHORS PROFILE Sreemathy J received BE degree at 2007 in Computer Science & Engineering from Park College of Engineering and Technology. Workshop Speech and Natural Language. Howland. ”Travel”. and Z. 2004. Rosen. ”Education”. ISSN : 0975-3397 Vol.. B. It contains around 150 documents with the above mentioned categories. S. Data Set 1: Self Made For the development used a small self-made corpus that contains standard categories such as ”Science”. P. 2002.Lawrence. W. 18. Xi. 34. T. J.” J. March 2011. no. 212-217. Prof. Sebastian. N.” BIT Numerical Math. pp. vol. Coimbatore. Park. The advantage of the proposed approach is. “Feature Selection and Feature Extraction for Text Categorization. Computer Science & Engineering at Karpagam University. P. using the freely available tool SX (Clark 2001). Yan. In future the classification model can be build. Yan. CONCLUSION Analyzed the text classification using the Naive Bayesian and K-Nearest Neighbor classification. 2003. pp.uci. “Machine Learning in Automated Text Categorization. “Efficient and Robust Feature Extraction by Maximum Margin Criterion. which analyzes terms on the concept sentence in document. and S.” Proc. “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing.” ACM Computing Surveys. 43. VIII. Sebastiani.html. Currently he is pursuing his Doctoral degree at Anna University. Bernhard eds. Zang. no. H. Jeon. 3. and Shie-Jue Lee.J. pp. D. and ”Movies”. ”Sports”. 6. IEEE TRANS ON Knowledge and Data Eng. vol. 2005. ”Business”. Lewis. pp.D. F. Jiang. 03 March 2012 396 .