You are on page 1of 8

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.

9, September, 2012

Comparison of Supervised Learning Techniques for Binary Text Classification


Hetal Doshi
Dept of Electronics and Telecommunication KJSCE, Vidyavihar Mumbai - 400077, India hsdoshi@gmail.com Abstract Automated text classifier is useful assistance in information management. In this paper, supervised learning techniques like Nave Bayes, Support Vector Machine (SVM) and K Nearest Neighbour (KNN) are implemented for classifying certain categories from 20 Newsgroup and WebKB dataset. Two weighting schemes to represent documents are employed and compared. Results show that effectiveness of the weighting scheme depends on the nature of the dataset and modeling approach adopted. Accuracy of classifiers can be improved using more number of training documents. Nave Bayes performs mostly better than SVM and KNN when number of training documents is few. The average amount of improvement in SVM with more number of training documents is better than that of Nave Bayes and KNN. Accuracy of KNN is lesser than Nave Bayes and SVM. Procedure to evaluate optimum classifier for a given dataset using cross-validation is verified. Procedure for identifying the probable misclassified documents is developed.
Keywords- Nave Bayes, SVM, KNN, Supervised learning and text classification.

Maruti Zalte Dept of Electronics and Telecommunication KJSCE, Vidyavihar Mumbai - 400077, India mbzalte@rediffmail.com A binary text classifier is a function that maps input feature vectors x to output class/category labels y = (1, 0). Aim is to learn and understand the function f from available labeled training set of N i/p o/p pairs (xi, yi), i = 1N [5]. This is called as supervised learning as opposed to unsupervised learning which doesnt comprise of labeled training set. There are two ways of implementing a classifier model. In discriminating model, the aim is to learn function that computes the class posterior p(y/x), thus it discriminates between different classes given the input. In generative model, the aim is to learn the class conditional density p(x/y) for each value of y and also learn class priors p(y) and then by applying Bayes rule, compute the class posterior, as shown below [5], (1) This is known as generative model as it specifies a way to generate the feature vector x for each possible class y. Nave Bayes classifier is an example of generative model while SVM is an example of discriminative model. KNN adopts a different approach than Nave Bayes and SVM. In KNN, calculations are deferred till actual classification and model building using training examples is not performed. In this paper, section II explains the Nave Bayes and TF (Term Frequency) & TF*IDF (Term Frequency * Inverse Document Frequency) weighting schemes. Section III explains the SVM and its Quadratic programming optimization problem. Section IV describes the KNN classifier and distance computation method. Section V provides the implementation steps for binary text classifier. Section VI discusses result analysis followed by conclusions in section VII. II.
NAVE BAYES

I.

INTRODUCTION

Manually organizing large set of electronic documents into required categories/classes can be extremely taxing, time consuming, expensive and is often not feasible. Text classification also known as text categorization deals with the assignment of text to a particular category from a predefined set of categories based on the words of the text. Text classification combines the concepts of Machine learning and Information Retrieval. Machine Learning is a field of Artificial Intelligence (AI) that deals with the development of techniques or algorithms that will let computers understand and extract pattern in the given data. Various applications of machine learning in the field of speech recognition, computer vision, robot control etc are discussed in [1]. Text classification finds applications in various domains like Knowledge Management, Human Resource Management, sorting of online information, emails, information technology and internet [2]. Text classification can be implemented using various supervised and unsupervised machine learning techniques [3]. Various performance parameters for binary text classification evaluation are discussed in [4]. Accuracy is the evaluation parameter for classifiers implemented in this paper.

Nave Bayes (NB) is based on the probabilistic model that uses collection of labeled training documents to estimate the parameters of the model and every new document is classified using Bayes rule by selecting the category that is most likely to have generated the new example [8]. Principles of Bayes theorem and its application in developing Nave Bayes classifier is discussed in [6]. Naive Bayes has simplistic approach in its training and classification phase [7]. Nave Bayes model assumes that all the attributes of the training documents are independent of each other given the context of the class. Reference [8]

52

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

describes the differences and details of the two models of Nave Bayes Bernoulli Multivariate model and Multinomial model. Its results show that accuracy achieved by multinomial model is better than that achieved by Bernoulli multivariate model for large vocabulary. Multinomial model of Nave Bayes is selected for implementation in this paper and its procedure is described in [9] and [10]. If document Di (representing the input feature vector x) is to be classified (i=1N), the learning algorithm should be able to classify it in required category Ck (representing output class y). Category can be either C1, category with label 1or C0, category with label 0. In Multinomial model, document feature vector captures word frequency information and not just its presence or absence. In this model, a biased V sided dice is considered and each side of the dice represents the word Wt with probability p(Wt/Ck), t = 1V. Thus at each position in the document dice is rolled and a word is inserted. Thus a document is generated as bag of words which includes which words are present in the document and their frequency of occurrence. Mathematically this can be achieved by defining Mi as multinomial model feature vector for the ith document Di. Mit is the frequency with which word Wt occurs in document Di and ni is the total number of words in Di. Vocabulary V is defined as the number of unique words (found in documents). Training documents are scanned to obtain following counts, N: Number of documents Nk: Number of documents of class Ck , for both the classes Estimate likelihoods p(Wt/Ck) and priors p(Ck) Let Zik = 1 when Di has class Ck and Zik = 0, otherwise. Let N be the total number of documents then, [9]

Calculation mentioned above is done for both categories and is compared. Depending upon which of the two values are greater, label of that category is assigned to the testing document. The assigned labels are compared with the true labels for each testing document to evaluate accuracy. Mit represents Term Frequency (TF) representation method in which frequency of occurrence of a particular word Wt in a given document Di is captured (local information). But the TF representation also has a problem that it scales up the frequent terms and scales down rare terms which are mostly more informative than the high frequency terms. The basic intuition is that a word that occurs frequently in many documents is not a good discriminator. The weighting scheme can help solving this problem. TF*IDF provides information about how important a word is to a document in a collection. TF*IDF weighting scheme does this by incorporating the local and the global information. This is because it takes into consideration not only the isolated term but also the term within the document collection [4]. NFt = Document frequency or number of documents containing term t and N = Number of documents. NFt /N = Probability of selecting a document containing a queried term from a collection of documents. Log(N/NFt) = Inverse Document Frequency, IDFt, represents global information. 1 is added to NFt to avoid division by zero in some cases. It is better to multiply TF values with IDF values, by considering local and global information. Therefore weight of a term = TF * IDF. This is commonly referred to as, TF* IDF weighting. Now as longer documents with more terms and higher term frequencies tend to get larger dot products than smaller documents which can skew and bias the similarity measures, normalization is recommended. A very common normalization method is dividing weights by the L2 norm of the documents. The L2 norm of the vector representing a document is simply the square root of the dot products of the document vector by itself. III.
SUPPORT VECTOR MACHINE

(2)

If a particular word doesnt appear in a category, then the probability calculated by (2) will become zero. To avoid this problem, Laplace smoothing is applied, [9]

(3) The priors are estimated as, (4) After training is performed and parameters are ready, for every new unlabelled document, Dj, the posterior probability for each category is estimated as [9]

An SVM model is a representation of training documents as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. Certain properties of text like high dimensional feature spaces, few irrelevant features i. e. dense concept vector and sparse document vectors are well handled by SVM making it suitable for the application of document classification [11]. Support Vector Machine classification algorithm is based on maximum margin training algorithm. It finds a decision function D(x) for pattern vectors x of dimension V belonging to either of the two category 1 and 0 (-1). The input to the training algorithm is a set of N examples xi with labels yi i. e (x1, y1), (x2, y2), (x3, y3)..(xN, yN).

(5)

53

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

Fig. 1 shows the two dimensional feature space with vectors belonging to one of the two categories. Two dimensional space is considered for simplicity in understanding of maximum margin training algorithm. Objective is to have maximum margin M as wide as possible which separates the two categories. SVM maximize the margin around the separating hyper plane. The decision function is fully specified by a subset of training samples (support vectors).

constraints by preferably a small amount . This approach also helps in allowing improvements in generalization and is called as soft margin SVM. In this approach, slack variables are introduced as shown in the quadratic programming model,

Subject to,

(8) The value of C is a regularization parameter which trades between how large of a margin is preferred, as opposed to number of the training set examples that violate this margin and by what amount [12]. Optimum value of C is obtained using cross validation process. The way of proceeding with the mathematical analysis is to convert soft margin SVM problem (8) into an equivalent Lagrangian dual problem which is to be maximized [12] & [15] using Lagrange multiplier i
Figure 1: Maximum margin solution in two dimensional space

To obtain the classifier boundary in terms of ws and b, two hyper-planes are defined Plus hyper-plane as and minus hyper-plane as which are the borders of the maximum margin. Distance between plus and minus hyper-plane is called as margin M which is to be maximized . (6) Margin that, 2/ is to be maximized , given the fact

Subject to constraints,

(9) The bias b is obtained by applying decision function to two arbitrary supporting patterns x1 belonging to C1 and supporting pattern x2 belonging to C0 [13] (10)

(7) Sometimes vectors are not linearly separable as indicated in the fig. 2 below

LIBSVM is a library for Support Vector Machine (SVM) used for implementing SVM for text classification in this paper [14]. Its objective is to assist the users to easily apply SVM to their respective applications. Reference [17] provides the practical aspects involved in implementing SVM algorithm and using linear kernel for text classification which involves large number of features. IV.
K NEAREST NEIGHBOUR

Figure 2: Non- linearly separable vector points

Hence there is a need to soften the constraint that these data points lie on the correct side of plus and minus hyperplanes i.e. some data points are allowed to violate these

K nearest neighbour is one of the pattern recognition techniques employed for classification based on the technique of evaluating closest training examples in the multidimensional feature space. It is a type of instance based or lazy learning where the decision function is approximated locally and all the computations are deferred until classification. The document is classified by a majority vote of its neighbours with document being assigned to the category most common amongst its K nearest neighbour. Reference [16] explains how K Nearest Neighbour can be applied to text classification application and importance of K value for a given problem. KNN is based on calculating distance of the query document from the training documents

54

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

[18].Cosine similarity is selected for distance measurement as documents are represented as vectors in multidimensional feature space and distance between documents is calculated as following (11) For improving the accuracy of KNN classifier, optimum value of K is important. K is the parameter which indicates the number of nearest neighbours to be considered for label computation. The accuracy of KNN classifier is severely affected by the presence of noisy and irrelevant features. The best value of K is data dependent. The larger value of K reduces the effect of noise and results in smoother, less locally sensitive decision function but it makes the boundaries between the classes less distinct resulting in misclassification. Evaluating the optimum value of K is achieved by cross-validation. V.
IMPLEMENTATION STEPS OF BINARY TEXT CLASSIFIER

A. Comparison of TF and TF*IDF weighting scheme Total accuracy for TF and TF*IDF representation is obtained after performing 10 fold cross-validation process for different values of C for SVM classifier and K for KNN classifier. In 10 fold cross-validation process, the entire training set for a given group is divided into 10 subsets of almost equal size. Sequentially one subset is tested using the classifier trained on remaining 9 subsets. Thus the process is repeated 10 times. Total accuracy is evaluated. As the division of the entire training set for a given group into almost equal 10 parts is performed randomly, three iterations are performed and average is calculated to obtain optimum C value for SVM and K value for KNN which provides highest cross-validation accuracy. SVM and KNN are implemented using these optimum values of C and K. Comparison of TF and TF*IDF representation of 20 Newsgroup group 1 is shown in Table I, for 20 Newsgroup group 2 in Table II, for WebKB group 1in Table III and WebKB group 2 in Table IV.
TABLE I.
COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20 NEWSGROUP GROUP 1

There are primarily three steps in implementing a binary text classifier using MATLAB TM as a tool, 1. Feature extraction: In this step, text document comprising of words on the basis of which classification is performed is converted into a matrix format capturing the property of the words found in the documents. This can be done in two ways - TF or TF*IDF. Matrix is created, where the number of rows is equal to number of documents and number of columns represents number of words in the dictionary defined for a given classification task and individual element of matrix represents TF or TF*IDF weights in the respective documents. 2. Training the classifier: During the training phase the classifier is provided with training documents along with the labels of training documents. Classifier develops the model representing the pattern with which training documents are related to their labels on the basis of the words appearing in the document. Parameter tuning is performed using cross-validation process. 3. Testing the classifier: On the basis of the model developed, classifier predicts labels for the testing documents. Accuracy of classification is assessed by comparing the predicted labels with the true labels. Accuracy = Number of correctly classified testing documents / total number of testing documents. VI.
RESULT ANALYSIS

Training docs: 1197 Testing docs: 796 Document representation TF 97. 236 C=1 K=1 TF*IDF 97.613 C=1 K= 30 TABLE II.
COMPARISON OF TF AND TF*IDF REPRESENTATION FOR 20 NEWSGROUP GROUP 2

Accuracy %
NB SVM KNN

93.844

85.302

97.362

96.231

Training docs: 1185 Testing docs: 789 Document representation TF 96.831 C=1 K=1 TF*IDF 96.451 C=1 K= 10 TABLE III.
COMPARISON OF TF AND TF*IDF REPRESENTATION FOR WEBKB GROUP 1

Accuracy %
NB SVM KNN

92.269

79.214

96.198

93.156

For implementation and evaluation of text classifiers, the datasets like 20 Newsgroup and WebKB are used available at [20]. Difference in the nature of the dataset is provided in [19]. Within individual dataset, there are two groups. For 20 Newsgroup, it is group 1rec.sport.baseball & rec.sport.hockey and group 2-sci.electronics & sci.med. For WebKB dataset, there are also two groups, group 1: Faculty and Course and group 2: Student and course. Accuracy evaluation is performed for all four groups using TF and TF*IDF document representation.

Training docs: 1358 Testing docs: 684 Document representation TF 97.368 C=0.01 K=10 TF*IDF 97.222 C=1 K= 45 98.538 94.152 97.807 94.298 Accuracy %
NB SVM KNN

55

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012
TABLE IV.
COMPARISON OF TF AND TF*IDF REPRESENTATION FOR WEBKB GROUP 2

Training docs: 1697 Testing docs: 854 Document representation TF 97.658 C=0.1 K=10 TF*IDF 96.956 C=1 K= 40 98.009 94.496 97.19 94.262 Accuracy %
NB SVM KNN

TF*IDF emphasizes the weight of low frequent terms. For SVM and KNN, difference in the accuracy for TF*IDF and TF representation is more pronounced in 20 Newsgroup Dataset as compared to WebKB dataset as seen from Table I - IV. This, results due to the fact that contribution of low frequency words to text categorization is significant in 20 Newsgroup dataset as compared to WebKB dataset [19]. SVM and KNN adopt spatial distribution of documents in multidimensional feature space. These techniques attempt to solve the classification problem using spatial means. SVM tries to find hyper-plane in that space separating the categories. Classification model of SVM depends on support vectors. KNN tries to compute which K training examples are closest to the testing document. TF*IDF weighting affects the spatial domain largely helping SVM and KNN to perform better. It is observed for Naive Bayes classifier, performance in terms of accuracy doesnt vary much for TF or TF*IDF representation. Naive Bayes builds a classifier based on probability of words and their relative occurrences in different categories. Hence all the training documents are used for building the model for TF and TF*IDF representation. Hence its performance is high for the TF representation and doesnt change much for TF*IDF representation. The contribution of low frequency words to text categorization is not significant in WebKB as compared to 20 Newsgroup [19]. Hence for WebKB dataset, performance difference between TF and TF*IDF representation is not significant for all three classifiers. B. Comparison over increasing training documents Comparison of all the classifiers (TF*IDF representation) for increasing number of training documents is shown below in fig. 3 for 20 Newsgroup group 1, fig. 4 for 20 Newsgroup group 2, fig. 5 for WebKB group 1 and fig. 6 for WebKB group 2.

Figure 3: Comparison of Nave Bayes, SVM and KNN for 20 Newsgroup Group 1

Figure 4: Comparison of Nave Bayes, SVM and KNN for 20 Newsgroup Group 2

Figure 5: Comparison of Nave Bayes, SVM and KNN for WebKB Group 1

56

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

Naive Bayes builds a classifier based on probability of words and their relative occurrences in different categories. This inherently requires less data than SVM. SVM builds a hyper plane to separate the two categories, with maximum margin. Thus, it needs more training documents, especially training documents close to the hyper plane, to develop its accuracy. Thus, NB performs better for data sets with low training. SVM requires building a hyper plane which best separates the two categories. With more training documents, it acquires more data around supporting hyper-plane to provide better solution. Thus, SVM learns the data better as more training documents are provided; especially training data close to the hyper plane. These training documents are the support vectors. Naive Bayes uses word counts/frequencies as a feature to distinguish between data sets. Each word provides a probability that the document is in a particular class. The individual probabilities are combined to arrive at a final decision. It is expected that adding more words (as a result of adding more training documents) would not drastically change the performance level of Naive Bayes, and this is what is observed. As a result with more training documents the average performance of SVM improves more, relative to the improvement in Naive Bayes. KNN on the other hand considers the entire multi dimensional feature space as a whole and obtains the labels for testing documents on the basis of nearest neighbour concept. Hence classification is done not on the basis of model building and is dependent on local information as emphasis is given to K nearest neighbour for label computation. Thus its accuracy is mediocre compared to Naive Bayes and SVM C. Selection of classifier Performance of the classifier can be predicted using cross validation results which is shown in Table V for TF representation and Table VI for TF *IDF representation. Cross-validation (CV) results for Nave Bayes, SVM and KNN classifier and classification accuracy results for all three classifiers are provided below in Table V and VI.

Figure 6: Comparison of Nave Bayes, SVM and KNN for WebKB Group 2

Results show that the accuracy of the classifier is dependent on the number of training documents and larger number of training documents can increase the accuracy of classification task. In case of Nave Bayes, increase in the classification accuracy with larger training size is the result of improvement in accuracy of probability estimation with more training documents as larger possibilities are covered. In case of SVM, increase in the classification accuracy with larger training size is the result of obtaining the hyperplane which provides a more generalized solution thus avoiding over fitted solution. In case of KNN, increase in the classification accuracy with larger training size results due to the fact that with large number of training documents, effect of noisy training example on the classification accuracy reduces. Also it reduces the locally sensitive nature of KNN classifier. Results show that accuracy of Nave Bayes classifier is usually better than SVM and KNN classification accuracy when number of training documents is less. But as the training set size increase, SVM classification accuracy becomes comparable to Nave Bayes and in certain cases becomes better. KNN is observed to have accuracy lower than Nave Bayes and SVM. Also average amount of improvement in SVM with more training documents is better than that of KNN and Nave Bayes as seen from fig. 3, fig 4, fig. 5 and fig. 6

TABLE V. Group TF representation 20 Newsgroup group 1 20 Newsgroup group 2 WebKB group 1 WebKB group 2

ACCURACY AND CROSS VALIDATION RELATIONSHIP FOR TF REPRESENTATION

Nave Bayes
Accuracy % CV % Accuracy %

SVM
CV % Accuracy %

KNN
CV %

97.236 96.831 97.368 97.658

98.747 98.819 96.533 97.346

93.844 92.269 97.807 97.19

96.825 97.258 97.619 96.778

85.302 79.214 94.298 94.262

94.627 91.702 93.813 95.129

57

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

TABLE VI. Group TF*IDF representation 20 Newsgroup group 1 20 Newsgroup group 2 WebKB group 1 WebKB group 2

ACCURACY AND CROSS VALIDATION RELATIONSHIP FOR TF*IDF REPRESENTATION Nave Bayes
Accuracy % CV % Accuracy %

SVM
CV % Accuracy %

KNN
CV%

97.613 96.451 97.222 96.956

98.218 98.677 96.392 97.584

97.362 96.198 98.538 98.009

99.249 98.790 97.913 98.035

96.231 93.156 94.152 94.496

97.882 97.720 94.035 96.719

In real life application, true labels of the testing documents are not available. To ensure that classifier selection is optimum for a given dataset, it is advisable to perform cross-validation on the training dataset using different classifiers. Observing the results of the crossvalidation process, the classifier which provides the best results should be selected for the text classification task. It is seen from Table V and Table VI that mostly whenever a classifier gives relatively better cross-validation performance; it also gives better accuracy results on testing data. Cross-validation helps in estimating how accurately, a predictive model of a classifier will generalize to a testing dataset compared to other classifiers D. Identification of misclassified documents In real life application, labels for query documents are not available. SVM may give best performance but identifying the misclassified documents is not possible with
TABLE VII. Group

one classifier. After performing classification operation by all three classifiers, the results of all three classifier are added and ranks are given to testing documents. Thus every testing document is assigned a rank which is 0, 1, 2 or 3. Obtaining the rank 0 means all three classifier has assigned label 0 to that testing documents. Obtaining rank 3 means all three classifier has assigned label 1 to that testing document. Obtaining rank 1 means two out of the three classifiers has assigned a label 0 to the testing document. Obtaining rank 2 means two out of three documents has assigned label 1 to that testing document. Hence discrepancy between classifier results is observed when the testing document is assigned rank 1 or rank 2. The documents with rank 1 and rank 2 are indicated to the user. This indication alerts the user about the documents which may be misclassified. For labels, SVM predicted labels are considered. Table VII summarizes this procedure for identification of misclassified documents

IDENTIFICATION OF MISCLASSIFIED DOCUMENTS

Percentage of flagged as misclassified 3.89 4.94 5.11 4.91

documents probably

Number of documents misclassified by SVM 21 30 10 17

Number of misclassified documents identified 11 9 2 5

20 Newsgroup group 1 20 Newsgroup group 2 WebKB group 1 WebKB group 2

It is seen from the Table VII that it is possible to identify few of misclassified documents using the combination of results of three classifiers. Around 5% of documents are flagged as probably misclassified documents. As actual labels are available for testing documents, it is seen that out of total misclassified documents by SVM, some are identified. All are not identified as they were ranked as 0 or 3 indicating that none of the classifiers could assign correct labels to those documents. VII. CONCLUSIONS Implementation and evaluation of Nave Bayes, SVM and KNN on categories from 20 Newsgroup and WebKB dataset using two different weighting schemes resulted in several

conclusions. Effectiveness of the weighting schemes to represent documents depends on the nature of dataset and also on the modeling approach adopted by the classifiers. Classification accuracy of the classifier can be improved using more training documents thus helping in more generalized solution covering larger possibilities. Nave Bayes performs mostly better than SVM and KNN when number of training documents is few. The average amount of improvement in SVM with more training documents is better than that of KNN and Nave Bayes. Parameter tuning in case of SVM and KNN using the cross-validation assists in achieving generalized solution suitable for a given dataset. Classification in KNN is not done on the basis of model building and is dependent only on local information. Thus its classification accuracy is lesser

58

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 10, No.9, September 2012

than Nave Bayes and SVM. Procedure to evaluate suitable classifier for a given dataset using cross-validation process is verified. Procedure for identifying the probable misclassified documents is developed by combining the results of three classifiers as they adopt different approach to text classification problem.
REFERENCES [1] M. Mitchell The Discipline of Machine Learning Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Technical report CMU-ML-06-108. [2] Vishal Gupta, Gurpreet S. Lehal, A survey of Text Mining Techniques and Applications, in Journal Of Emerging Technologies in Web Intelligence, Vol 1, No 1, August 2009, pp. 60 -76. [3] George Tzanis, Ioannis Katakis, Ioannis Partalas, Ioannis Vlahavas, Modern Applications of Machine Learning, in Proceedings of the 1st Annual SEERC Doctoral Student Conference DSC 2006. pp. 1-10. [4] Fabrizio Sebastiani, Text categorization, In: Alessandro Zanasi (ed.), Text mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109-129. [5] Kevin P. Murphy, Nave Bayes classifier, Technical report, Department of Computer Science, University of British Columbia, 2006. [6] Haiyi Zhang, DiLi, Nave Bayes Text Classifier, in 2007 IEEE International Conference on Granular Computing, pp. 708-711. [7] S.L. Ting, W.H. Ip, Albert H.C. Tsang, Is Nave Bayes a Good Classifier for Document Classification?, In: International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011, pp. 37-46. [8] Andrew McCallum and Kamal Nigam, A Comparison of Event Models for Nave Bayes Text Classification, In: Learning for Text Categorization: Papers from the AAAI workshop, AAAI pressc(1998) 41 48 Technical report Ws 98 05. [9] Text Classification using Nave Bayes, Steve Renals, Learning and Data lecture 7, Informatics 2B. Available online: http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b11learnlec07-nup.pdf [10] Generative learning algorithm, lecture notes2 for CS229, Department of Computer Science, University of Stanford. Available online: http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf [11] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. [12] Brian C. Lovell and Christian J. Walder, Support Vector Machines for Business Applications , Business Applications and Computational Intelligence, Idea Group Publishers, 2006. [13] Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik, A Training Algorithm for Optimal Margin Classifiers In proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM press, 1992, pp. 144-152. [14] Chih Chung Chang and Chih Jen Lin, LIBSVM: A library for Support Vector Machines, Department of Computer Science, National Taiwan University, Taipei, Taiwan, 2001. [15] Nello Cristianini , John Shawe-Taylor, An Introduction to support Vector Machines: and other kernel-based learning methods, Cambridge University Press, 2000. [16]KNN classification details available online http://www.mathworks.in/help/toolbox/bioinfo/ref/knnclassify.html at

International Workshop on Education Technology and Computer Science, pp. 219 222. [19] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, Distributional word clusters vs. words for text categorization, in Journal of Machine Learning Research, Volume 3, 2003. pp. 11831208. [20]Dataset used in this paper http://web.ist.utl.pt/~acardoso/datasets/ available online at

AUTHORS PROFILE Hetal Doshi has received B. E. (Electronics) degree in 2003 from University of Mumbai and is currently pursuing her M. E. (Electronics and Telecommunication) from K. J. Somaiya College of Engineering (KJSCE), Vidyavihar, Mumbai, India. She is in the teaching profession for last 8 years and is working as Assistant Professor at KJSCE. Her area of interest is Education Technology, Text Mining and Signal Processing. Maruti Zalte has received M. E. (Electronics and Telecommunication) degree in 2006 from Govt. College of Engineering, Pune. He is in the teaching profession for last 9 years and is working as Associate Professor at KJSCE. His area of interest is Digital Signal Processing and VLSI technology. He is currently holding the post of Dean, Students Affairs, KJSCE.

[17] Hsu, C.-W., Chang, C.-C., and Lin, C.-J. A practical guide to support vector classification., Technical report, Department of Computer Science, National Taiwan University, 2003. [18]Zhijie Liu, Xueqiang Lv, Kun Liu, Shuicai Shi, Study on SVM compared with the other Text Classification Methods in 2010 Second

59

http://sites.google.com/site/ijcsis/ ISSN 1947-5500