You are on page 1of 6

IJCSIT International Journal of Computer Science and Information Technology, Vol. 4, No. 2, December 2011, pp.

85-90

Comparison of Classification Algorithms using


WEKA on Various Datasets
Bharat Deshmukh, Ajay S. Patil2 & B.V. Pawar2
1
Sinhgad Institute of Management and Computer Application (SIMCA), Pune- 411041 (M.S.), India
E-mail: bdeshmukh2000@gmail.com
2
Department of Computer Science, North Maharashtra University, Jalgaon-425001 (M.S), India
E-mail: aspatil@nmu.ac.in 1, bvpawar@nmu.ac.in2

ABSTRACT: Data mining is a step in the knowledge discovery process consisting of data mining algorithms that used to
finds patterns or models in data. Data Mining also can be define as an analytic process designed to explore large amounts
of data in search for consistent patterns and systematic relationships between variables and then to validate the findings by
applying the detected patterns to new subsets of data. Classification is the most commonly applied data mining technique,
which employs a set of pre-classified examples to develop a model that can classify the population of records at large. In
classification techniques a model is built based on training data and applied to test data. WEKA is an open source data
mining tool which includes implementation of data mining algorithms. Using WEKA we have compared the ADTree, Bayes
Network, Decision Table, J48, Logistic, Naive Bayes, NBTree, PART, RBFNetwork and SMO algorithms. To compare these
algorithms we have used five datasets.
Keywords: Algorithms, Data Mining, Classification

1. INTRODUCTION on UCI dataset repository2 and bank data set from


Data mining[1] is the rapidly growing interdisciplinary Depaul University3.
field, which merges together database management, An ADTree (Alternative Decision Tree) [6] has
statistics, machine learning and related areas aiming new semantic representation of decision tree which has
at extracting useful knowledge from large collections prediction nodes at leaves and at root as well. A Bayes
of data. The data mining process consists of three basic network algorithm [7] is based on Bayes theorem and
stages: exploration, model building or pattern is a directed acyclic graphical model which is used to
definition, and validation/verification. Ideally, if the represent the conditional dependencies between set of
nature of available data allows, it is typically repeated random variables. There are two main limitations of
iteratively until a robust model is identified. Bayes network, first is the computational difficulty of
However, in business practice the options to validate exploring a previous unknown network and second is
the model at the stage of analysis are typically limited quality of prior beliefs used in calculating the network.
and, thus, the initial results often have the status of Decision tables [2] are used to lay out in tabular form
heuristics that could influence the decision process. all possible situations which a business decision may
Data mining can be done with large number of encounter and to specify which action to take in each
algorithms and techniques which includes of these situations. It is the simplest machine learning
classification, clustering, regression, association rule, algorithm, which is of tabular format which indicates
artificial intelligence, neural networks, genetic the table entries for input and output. A J48, decision
algorithm, nearest neighbor method etc. WEKA 1 tree [2] is a predictive machine-learning model that
includes implementation of various classification decides the target value (dependent variable) of a new
algorithms like Decision trees, Nave Bayes , ZeroR sample based on various attribute values of the
etc. In this paper we have studied and compared the available data. Logistic [8] is a linear classifier for
ADTree, Bayes Network, Decision Table, J48, supervised learning which has properties like feature
Logistic, Naive Bayes, NBTree, PART, RBF Network selection and robustness to noise. The Nave Bayes
and SMO algorithms using the four dataset available classifier [9] works on a simple, but comparatively
86 International Journal of Computer Science and Information Technology

intuitive concept. Also, in some cases it is also seen algorithms which includes C4.5, k-Means, SVM,
that Nave Bayes outperforms many other Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes,
comparatively complex algorithms. It makes use of the and CART. Andrew Secker, Matthew N. Davies et.al
variables contained in the data sample, by observing [3] compared different classification algorithms for
them individually, independent of each other. The hierarchical prediction for protein function based on
Nave Bayes classifier is based on the Bayes rule of the predictive accuracy of classifiers. In [4], the author
conditional probability. It makes use of all the attributes has performed the experiments on weed and crop
contained in the data, and analyses them individually images and dataset to test the classification algorithms.
as though they are equally important and independent In [5], Ryan Potter has performed the comparison of
of each other. The Nave Bayes classifier will consider classification algorithms on breast cancer dataset to
each of these attributes separately when classifying a perform the diagnosis of the patients.
new instance. NBTree [10] is an hybrid approach which
includes the capabilities of decision tree and nave 3. EXPERIMENTAL DETAILS
bayes classifier, the decision-tree nodes contain splits We have used the data sets available from Depaul
as regular decision-trees, but the leaves contain Naive- University and UCI machinery website. From WEKA
Bayesian classifiers. PART [11] is a partial decision GUI we have tested the ADTree (ADT), Bayes
tree which is an extension of C 4.5 for generation the Network (BayesNet), DecisionTable (DT), J48,
rule set. PART builds a rule, removes the instances it Logistic, naive bayes (NB), NBTree (NBT), PART,
covers, and continues creating rules recursively for the RBFNetwork (RBFN) and SMO algorithms on five
remaining instances until none are left. RBF (radial data sets and observed the following results. We have
basis function) network [12] is a variant of neural used the bank, car, breast cancer, credit-g and diabetes
network, which are embedded in to two layers. In order datasets. Table 1 shows the brief description about the
to use radial basis function we need to specify the each dataset. We have studied and compared these
hidden unit activation function, the number of algorithms on the parameters like Correctly Classified
processing units, a criterion for modeling the given task Instances (CCI), Incorrectly Classified Instances (ICI),
and a training algorithm for finding the parameters of Kappa Statistic (KS), Mean Absolute Error (MAE),
the network. The SMO (sequential minimal Root Mean Squared Error (RMSE). Kappa statistics
optimization) [13] is extension of Support Vector is used to measure the inter-rater agreement for
Machines (SVM) to solve the problem of handling categorical items i.e. it is an index which compares
large datasets in SVM. the agreement against that which might be expected
by chance. Kappa can be thought of as the chance-
2. RELATED WORK corrected proportional agreement, and possible values
In [2], Xindong Wu, Vipin Kumar et al. has given the range from +1 (perfect agreement) via 0 (no agreement
descriptive study of the 10 data mining classification above that expected by chance) to -1 (complete
Table 1
Description of Datasets available from Depaul University and UCI repository

Data Set Number of Attributes Number of Classes


Attributes data items

Bank 11 age,sex, region, income, married, children, car, save-account, 600 2


current-account, mortgage, pep
Car 6 buying capacity, maintenance, number of doors, person seating 1728 4
capacity, luggage boot space, safety and class
Breast Cancer 10 class, age, menopause, tumor size, inv-nodes, nodes-cap, deg-malig, 286 2
breast, breast-quad, irradiat
Credit-g 20 checking account, month, credit history, purpose, amount, saving 1000 2
account, present employment, rate, sex, residence, property, age,
other installments, housing, dependent, telephone, foreign worker
Diabetes 9 number of times pregnant, plasma glucose concentration, blood 768 2
pressure, triceps skin fold thickness, serum insulin, body mass index,
diabetes pedigree function,age
Comparison of Classification Algorithms using WEKA on Various Datasets 87

disagreement). Mean absolute error is used to measure for bank dataset. The Kappa Statistic value for J48 is
how close predictions to the eventual outcome. It is much closer to 1 (i.e. 0.8178) which indicates that J48
average of absolute errors in the predictions. The root provides the perfect agreement for classification of data
mean squared mean error is a measure to the variance items. J48 has lesser error rate in mean absolute error
of the predictions. Root mean square error is a and root mean squared error as it provides the more
frequently-used measure of the differences between perfect predictions and lesser variance in predictions.
values predicted by a model or an estimator and the
values actually observed from the thing being modeled 3.2. Car Dataset
or estimated. There are 6 attributes (buying capacity, maintenance,
number of doors, person seating capacity, luggage boot
3.1. Bank Dataset space, safety and class) and 1728 data items in car data
In Bank dataset there are 11 attributes (age, sex, region, set. The data items are classified into four classes
income, married, children, car, save- account, current- unacc, acc, good and very good acceptance level of
account, mortgage and pep) and 600 data items, car by people based on six attributes.
classified into two classes, the classification is done
Table 3
whether the person will go for Pension Equity Plan Result from WEKA for Car Dataset
(PEP) or not.
Algorithm CCI(% ) ICI (%) KS MAE RMSE
Table 2
Result from WEKA for Bank Dataset ADT - - - - -
BayesNet 85.71 14.29 0.6713 0.1114 0.2254
Algorithm CCI(%) ICI(%) KS MAE RMSE
DT 91.03 08.97 0.7987 0.2748 0.3220
ADT 84.67 15.33 0.6853 0.3350 0.3728 J48 92.36 07.64 0.8343 0.0421 0.1718
BayesNet 70.00 30.00 0.3862 0.3968 0.4487 Logistic 93.11 06.89 0.8504 0.0428 0.1520
DT 80.83 19.17 0.6123 0.2988 0.3750 Naive Bayes 85.53 14.47 0.6665 0.1137 0.2262
J48 91.00 9.00 0.8178 0.1559 0.2903 NBT 94.21 05.79 0.8752 0.0676 0.1571
Logistic 73.00 27.00 0.4518 0.3607 0.4303 PART 95.78 04.22 0.9091 0.0241 0.1276
Naive Bayes 69.00 31.00 0.3724 0.3773 0.4397 RBFN 94.21 05.79 0.8752 0.0676 0.1571
NBT 88.67 11.33 0.7710 0.1766 0.3194 SMO 93.75 06.25 0.8649 0.2559 0.3202
PART 85.17 14.83 0.7003 0.1803 0.3573
RBFN 73.33 26.67 0.4585 0.3590 0.4317
For the car dataset PART performs the best
SMO 70.80 29.20 0.4062 0.2917 0.5401
followed by RBFNetwork and NBTree. ADTree is
disabled for car dataset in WEKA as it provides
predictions for a dataset with two classes. The Kappa
Statistic for PART is closest to perfect agreement (i.e.
0.9091). PART has highest percentage for correctly
classified instances and lesser mean absolute error and
root mean squared error.

Figure 1: Graph of KS, MAE and RMSE for Bank Dataset

For the bank dataset J48 decision tree performs


the best followed by NBTree and PART. J48 provides
the highest percentage of correctly classified instances Figure 2: Graph of KS, MAE and RMSE for Car Dataset
88 International Journal of Computer Science and Information Technology

3.3. Breast Cancer Dataset 3.4. Credit-g Dataset


Breast cancer dataset has 10 attributes (class, age, In credit German data set [16] there are 20 attributes
menopause, tumor size, inv-nodes, nodes-cap, deg- (checking account, month, credit history, purpose,
malig, breast, breast-quad and irradiat) and 286 data amount, saving account, present employment, rate, sex,
items. Data items are classified in no recurrence and residence, property, age, other installments, housing,
recurrence events based on the attributes. dependent, telephone and foreign worker) and 1000
data instances. Based on 20 attributes credit rating of
Table 4
Result from WEKA for Breast Cancer Dataset
a person is classified into two classes good and bad.

Algorithm CCI (%) ICI (%) KS MAE RMSE Table 5


Results from WEKA for Credit-g Dataset
ADT 73.78 26.22 0.3290 0.3919 0.4333
Algorithm CCI (%) ICI (%) KS MAE RMSE
BayesNet 72.03 27.97 0.2919 0.3297 0.4566
DT 73.43 26.57 0.2462 0.3748 0.4407 ADT 72.40 27.60 0.2988 0.3895 0.4315
J48 75.52 24.48 0.2826 0.3676 0.4324 BayesNet 75.50 24.50 0.3893 0.3101 0.4187
Logistic 68.88 31.12 0.1979 0.3700 0.4631 DT 71.00 29.00 0.2033 0.3677 0.4321
Naive Bayes 71.68 28.32 0.2857 0.3272 0.4534 J48 70.50 29.50 0.2467 0.3467 0.4796
NBT 70.98 29.02 0.2465 0.3265 0.4753 Logistic 75.20 24.80 0.3750 0.3098 0.4087
PART 71.33 28.67 0.1995 0.3650 0.4762 Naive Bayes 75.40 24.60 0.3813 0.2936 0.4201
RBFN 70.98 29.02 0.2177 0.3574 0.4443 NBT 75.50 24.50 0.3918 0.3102 0.4221
SMO 69.58 30.42 0.1983 0.3042 0.5515 PART 70.20 29.80 0.2767 0.3245 0.4974
RBFN 74.00 26.00 0.3340 0.3388 0.4204
SMO 75.10 24.90 0.3654 0.2490 0.4990
For the breast cancer dataset J48 decision tree
performs the best followed by ADTree and decision
table. J48 provides the highest percentage of correctly For the credit-g data set Bayes network and
classified instances for bank dataset. The Kappa NBTree performs the best followed by Nave Bayes
Statistic value for J48 is much closer to 0 (i.e. 0.2826) and RBFN. Bayes network and NBTree has the same
which is following to Kappa Statistic value of ADTree, number of correctly classified instances percentage.
which indicates that no more agreement is expected Both algorithms has almost same values for Kappa
by chance. But J48 has the highest percentage of statistics , mean absolute error and root mean squared
correctly classified instances than the ADTree and other error. NBTree is a hybrid approach based on Nave
algorithms. The values for mean absolute error and root Bayes and decision tree where as Bayes network is
mean squared error of J48 is almost closer to ADTree based on Nave Bayes only.
and Logistic.

Figure 3: Graph of KS, MAE and RMSE for Breast Cancer


Dataset Figure 4: Graph of KS, MAE and RMSE for Credit-g Dataset
Comparison of Classification Algorithms using WEKA on Various Datasets 89

3.5. Diabetes Dataset 4. CONCLUSION


The diabetes data set has 9 attributes (number of times In this paper we have studied and compared Bayes
pregnant, plasma glucose concentration, blood network, Nave Bayes, SMO, RBFNetwork, Logistic,
pressure, triceps skin fold thickness, serum insulin, Decision Trees (J48), ADTree, NBTree and Decision
body mass index, diabetes pedigree function and age) table algorithms on five data sets in WEKA. Overall
and 768 data instances. The instances are classified observation is that no algorithm performs the best for
into two classes whether the women is tested positive every dataset. For the bank and breast cancer dataset
for diabetes or not. the J48 has highest correctly classified instances than
remaining algorithms. For credit-g and diabetes dataset
Table 6
nave bayes is in first five algorithms. This shows that
Results from WEKA for Diabetes Dataset
there is no single classification algorithm which can
Algorithm CCI (%) ICI (%) KS MAE RMSE provide the best predictive model for all datasets. The
ADT 72.92 27.08 0.3736 0.3613 0.4195 accuracy of predictive model is affected by the
BayesNet 74.35 25.65 0.4290 0.2987 0.4208 selection of attribute. With this we can conclude that
DT 71.22 28.78 0.3492 0.3448 0.4277
the different classification algorithms are designed to
J48 73.83 26.17 0.4164 0.3158 0.4463
perform better for certain types of dataset.
Logistic 77.21 22.79 0.4734 0.3094 0.3954 REFERENCES
Naive Bayes 76.30 23.70 0.4664 0.2841 0.4168
[1] Daniel T. Larose, Data Mining Methods and Models,
NBT 74.35 25.65 0.4260 0.3099 0.4280 John Wiley & Sons, INC Publication, Hoboken, New Jersey
PART 75.26 24.74 0.4390 0.3101 0.4149 (2006).
RBFN 75.39 24.61 0.4303 0.3448 0.4191 [2] Xindog Wu, Vipin Kumar et al., Top 10 Algorithms in Data
Mining, Knowledge and Information Systems, 14(1), 1-37
SMO 77.34 22.66 0.4682 0.2266 0.4760 (2008).
[3] Andrew Secker, Matthew N. Davies et al., An Experimental
Comparison of Classification Algorithms for the Hierarchical
For the diabetes dataset SMO performs the best Prediction of Protein Function, Expert Update (the BCS-
followed by logistic and Nave Bayes. SMO has the SGAI) Magazine, 9(3), 17-22, (2007).
highest percentage of correctly classified instances [4] Martin Weis, Till Rumpf, Roland Gerhards, Lutz Plmer,
Comparison of Different Classification Algorithms for Weed
and higher value for Kappa statistics than other Detection from Images based on Shape Parameters, ATB
algorithms. The mean absolute error is lesser as SMO Publication Volume 69, ISSN 00947-7314, Page No. 53-64
gives the closer predictions for diabetes dataset and (2007).
root mean squared error is not less as it indicates that [5] Ryan Potter, Comparison of Classification Algorithms Applied
there is a possibility of variance in prediction similar to Breast Cancer Diagnosis and Prognosis, Wiley Expert
Systems, 24(1), 17-31, (2007).
which is similar to other algorithms for diabetes
[6] Yoav Freund and Llew Mason, The Alternative Decision Tree
dataset. Learning Algorithm International Conference on Machine
Learning, 124-133, (1999).
[7] Daryle Niedermayer, An Introduction to Bayesian Networks
and their Contemporary Applications Springer Studies in
Computational Intelligence, 56, 117-130, (2008).
[8] Jianing Shi, Wotao Yin et.al.,Fast Hybrid Algorithm for Large-
Scale !1-Regularized Logistic Regression, Journal of
Machine Learning Research, 11, 713-741, (2010).
[9] Kim Larsen, Generalized Naive Bayes Classifiers SIGKDD
Explorations. 7(1), 76-81, (2005).
[10] Manuel J. Fonseca, Joaquim A. Jorge, NB-Tree: An Indexing
Structure for Content-Based Retrieval in Large Databases,
Proceedings of the Eighth International Conference on
Database Systems for Advanced Applications, Pages 267-276,
(2003).
[11] Frank, Eibe Witten, Ian H, Generating Accurate Rule Sets
without Global Optimization, Proceedings of the Fifteenth
International Conference on Machine Learning, 144-151,
Figure 5: Graph of KS, MAE and RMSE for Diabetes Dataset (1998).
90 International Journal of Computer Science and Information Technology

[12] Adrian G. Bors, I. Pitas, Introduction to RBF Network, [13] JingminWang, KanzhangWu, Study of the SMO Algorithm
Online Symposium for Electronics Engineers, 1(1), 1-7 Applied in Power System Load Forecasting Springer LNCS,
(2001). pp. 1022-1026, (2006).