1 views

Uploaded by Crystal Welch

data mining journal

- Data Housing
- datamining.pdf
- Lecture 01
- Classifying-Spending-Behavior-using-Socio-Mobile-Data.pdf
- Mitchell Ch6 Lecture
- Quiz Feedback _ Coursera
- Syllabus
- Data Management
- els2015_2ed
- IML 2019IA 2ReportSubmissionFormat
- A Significant Review of Different Drought Indices for Predicting Agricultural Droughts
- k Kurata00inet ComparisonsRenoVegas
- V2I3001
- INSOFE-Comprehensive Curriculum on Big Data Analytics
- IJETTCS-2016-04-25-94
- Machine Learning for Text and Web Mining
- classifying lesson
- Lecture 05 Classification of Services
- Scha Pire 2003
- Wheat Leaf Disease Detection Using Image Processing

You are on page 1of 6

85-90

WEKA on Various Datasets

Bharat Deshmukh, Ajay S. Patil2 & B.V. Pawar2

1

Sinhgad Institute of Management and Computer Application (SIMCA), Pune- 411041 (M.S.), India

E-mail: bdeshmukh2000@gmail.com

2

Department of Computer Science, North Maharashtra University, Jalgaon-425001 (M.S), India

E-mail: aspatil@nmu.ac.in 1, bvpawar@nmu.ac.in2

ABSTRACT: Data mining is a step in the knowledge discovery process consisting of data mining algorithms that used to

finds patterns or models in data. Data Mining also can be define as an analytic process designed to explore large amounts

of data in search for consistent patterns and systematic relationships between variables and then to validate the findings by

applying the detected patterns to new subsets of data. Classification is the most commonly applied data mining technique,

which employs a set of pre-classified examples to develop a model that can classify the population of records at large. In

classification techniques a model is built based on training data and applied to test data. WEKA is an open source data

mining tool which includes implementation of data mining algorithms. Using WEKA we have compared the ADTree, Bayes

Network, Decision Table, J48, Logistic, Naive Bayes, NBTree, PART, RBFNetwork and SMO algorithms. To compare these

algorithms we have used five datasets.

Keywords: Algorithms, Data Mining, Classification

Data mining[1] is the rapidly growing interdisciplinary Depaul University3.

field, which merges together database management, An ADTree (Alternative Decision Tree) [6] has

statistics, machine learning and related areas aiming new semantic representation of decision tree which has

at extracting useful knowledge from large collections prediction nodes at leaves and at root as well. A Bayes

of data. The data mining process consists of three basic network algorithm [7] is based on Bayes theorem and

stages: exploration, model building or pattern is a directed acyclic graphical model which is used to

definition, and validation/verification. Ideally, if the represent the conditional dependencies between set of

nature of available data allows, it is typically repeated random variables. There are two main limitations of

iteratively until a robust model is identified. Bayes network, first is the computational difficulty of

However, in business practice the options to validate exploring a previous unknown network and second is

the model at the stage of analysis are typically limited quality of prior beliefs used in calculating the network.

and, thus, the initial results often have the status of Decision tables [2] are used to lay out in tabular form

heuristics that could influence the decision process. all possible situations which a business decision may

Data mining can be done with large number of encounter and to specify which action to take in each

algorithms and techniques which includes of these situations. It is the simplest machine learning

classification, clustering, regression, association rule, algorithm, which is of tabular format which indicates

artificial intelligence, neural networks, genetic the table entries for input and output. A J48, decision

algorithm, nearest neighbor method etc. WEKA 1 tree [2] is a predictive machine-learning model that

includes implementation of various classification decides the target value (dependent variable) of a new

algorithms like Decision trees, Nave Bayes , ZeroR sample based on various attribute values of the

etc. In this paper we have studied and compared the available data. Logistic [8] is a linear classifier for

ADTree, Bayes Network, Decision Table, J48, supervised learning which has properties like feature

Logistic, Naive Bayes, NBTree, PART, RBF Network selection and robustness to noise. The Nave Bayes

and SMO algorithms using the four dataset available classifier [9] works on a simple, but comparatively

86 International Journal of Computer Science and Information Technology

intuitive concept. Also, in some cases it is also seen algorithms which includes C4.5, k-Means, SVM,

that Nave Bayes outperforms many other Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes,

comparatively complex algorithms. It makes use of the and CART. Andrew Secker, Matthew N. Davies et.al

variables contained in the data sample, by observing [3] compared different classification algorithms for

them individually, independent of each other. The hierarchical prediction for protein function based on

Nave Bayes classifier is based on the Bayes rule of the predictive accuracy of classifiers. In [4], the author

conditional probability. It makes use of all the attributes has performed the experiments on weed and crop

contained in the data, and analyses them individually images and dataset to test the classification algorithms.

as though they are equally important and independent In [5], Ryan Potter has performed the comparison of

of each other. The Nave Bayes classifier will consider classification algorithms on breast cancer dataset to

each of these attributes separately when classifying a perform the diagnosis of the patients.

new instance. NBTree [10] is an hybrid approach which

includes the capabilities of decision tree and nave 3. EXPERIMENTAL DETAILS

bayes classifier, the decision-tree nodes contain splits We have used the data sets available from Depaul

as regular decision-trees, but the leaves contain Naive- University and UCI machinery website. From WEKA

Bayesian classifiers. PART [11] is a partial decision GUI we have tested the ADTree (ADT), Bayes

tree which is an extension of C 4.5 for generation the Network (BayesNet), DecisionTable (DT), J48,

rule set. PART builds a rule, removes the instances it Logistic, naive bayes (NB), NBTree (NBT), PART,

covers, and continues creating rules recursively for the RBFNetwork (RBFN) and SMO algorithms on five

remaining instances until none are left. RBF (radial data sets and observed the following results. We have

basis function) network [12] is a variant of neural used the bank, car, breast cancer, credit-g and diabetes

network, which are embedded in to two layers. In order datasets. Table 1 shows the brief description about the

to use radial basis function we need to specify the each dataset. We have studied and compared these

hidden unit activation function, the number of algorithms on the parameters like Correctly Classified

processing units, a criterion for modeling the given task Instances (CCI), Incorrectly Classified Instances (ICI),

and a training algorithm for finding the parameters of Kappa Statistic (KS), Mean Absolute Error (MAE),

the network. The SMO (sequential minimal Root Mean Squared Error (RMSE). Kappa statistics

optimization) [13] is extension of Support Vector is used to measure the inter-rater agreement for

Machines (SVM) to solve the problem of handling categorical items i.e. it is an index which compares

large datasets in SVM. the agreement against that which might be expected

by chance. Kappa can be thought of as the chance-

2. RELATED WORK corrected proportional agreement, and possible values

In [2], Xindong Wu, Vipin Kumar et al. has given the range from +1 (perfect agreement) via 0 (no agreement

descriptive study of the 10 data mining classification above that expected by chance) to -1 (complete

Table 1

Description of Datasets available from Depaul University and UCI repository

Attributes data items

current-account, mortgage, pep

Car 6 buying capacity, maintenance, number of doors, person seating 1728 4

capacity, luggage boot space, safety and class

Breast Cancer 10 class, age, menopause, tumor size, inv-nodes, nodes-cap, deg-malig, 286 2

breast, breast-quad, irradiat

Credit-g 20 checking account, month, credit history, purpose, amount, saving 1000 2

account, present employment, rate, sex, residence, property, age,

other installments, housing, dependent, telephone, foreign worker

Diabetes 9 number of times pregnant, plasma glucose concentration, blood 768 2

pressure, triceps skin fold thickness, serum insulin, body mass index,

diabetes pedigree function,age

Comparison of Classification Algorithms using WEKA on Various Datasets 87

disagreement). Mean absolute error is used to measure for bank dataset. The Kappa Statistic value for J48 is

how close predictions to the eventual outcome. It is much closer to 1 (i.e. 0.8178) which indicates that J48

average of absolute errors in the predictions. The root provides the perfect agreement for classification of data

mean squared mean error is a measure to the variance items. J48 has lesser error rate in mean absolute error

of the predictions. Root mean square error is a and root mean squared error as it provides the more

frequently-used measure of the differences between perfect predictions and lesser variance in predictions.

values predicted by a model or an estimator and the

values actually observed from the thing being modeled 3.2. Car Dataset

or estimated. There are 6 attributes (buying capacity, maintenance,

number of doors, person seating capacity, luggage boot

3.1. Bank Dataset space, safety and class) and 1728 data items in car data

In Bank dataset there are 11 attributes (age, sex, region, set. The data items are classified into four classes

income, married, children, car, save- account, current- unacc, acc, good and very good acceptance level of

account, mortgage and pep) and 600 data items, car by people based on six attributes.

classified into two classes, the classification is done

Table 3

whether the person will go for Pension Equity Plan Result from WEKA for Car Dataset

(PEP) or not.

Algorithm CCI(% ) ICI (%) KS MAE RMSE

Table 2

Result from WEKA for Bank Dataset ADT - - - - -

BayesNet 85.71 14.29 0.6713 0.1114 0.2254

Algorithm CCI(%) ICI(%) KS MAE RMSE

DT 91.03 08.97 0.7987 0.2748 0.3220

ADT 84.67 15.33 0.6853 0.3350 0.3728 J48 92.36 07.64 0.8343 0.0421 0.1718

BayesNet 70.00 30.00 0.3862 0.3968 0.4487 Logistic 93.11 06.89 0.8504 0.0428 0.1520

DT 80.83 19.17 0.6123 0.2988 0.3750 Naive Bayes 85.53 14.47 0.6665 0.1137 0.2262

J48 91.00 9.00 0.8178 0.1559 0.2903 NBT 94.21 05.79 0.8752 0.0676 0.1571

Logistic 73.00 27.00 0.4518 0.3607 0.4303 PART 95.78 04.22 0.9091 0.0241 0.1276

Naive Bayes 69.00 31.00 0.3724 0.3773 0.4397 RBFN 94.21 05.79 0.8752 0.0676 0.1571

NBT 88.67 11.33 0.7710 0.1766 0.3194 SMO 93.75 06.25 0.8649 0.2559 0.3202

PART 85.17 14.83 0.7003 0.1803 0.3573

RBFN 73.33 26.67 0.4585 0.3590 0.4317

For the car dataset PART performs the best

SMO 70.80 29.20 0.4062 0.2917 0.5401

followed by RBFNetwork and NBTree. ADTree is

disabled for car dataset in WEKA as it provides

predictions for a dataset with two classes. The Kappa

Statistic for PART is closest to perfect agreement (i.e.

0.9091). PART has highest percentage for correctly

classified instances and lesser mean absolute error and

root mean squared error.

the best followed by NBTree and PART. J48 provides

the highest percentage of correctly classified instances Figure 2: Graph of KS, MAE and RMSE for Car Dataset

88 International Journal of Computer Science and Information Technology

Breast cancer dataset has 10 attributes (class, age, In credit German data set [16] there are 20 attributes

menopause, tumor size, inv-nodes, nodes-cap, deg- (checking account, month, credit history, purpose,

malig, breast, breast-quad and irradiat) and 286 data amount, saving account, present employment, rate, sex,

items. Data items are classified in no recurrence and residence, property, age, other installments, housing,

recurrence events based on the attributes. dependent, telephone and foreign worker) and 1000

data instances. Based on 20 attributes credit rating of

Table 4

Result from WEKA for Breast Cancer Dataset

a person is classified into two classes good and bad.

Results from WEKA for Credit-g Dataset

ADT 73.78 26.22 0.3290 0.3919 0.4333

Algorithm CCI (%) ICI (%) KS MAE RMSE

BayesNet 72.03 27.97 0.2919 0.3297 0.4566

DT 73.43 26.57 0.2462 0.3748 0.4407 ADT 72.40 27.60 0.2988 0.3895 0.4315

J48 75.52 24.48 0.2826 0.3676 0.4324 BayesNet 75.50 24.50 0.3893 0.3101 0.4187

Logistic 68.88 31.12 0.1979 0.3700 0.4631 DT 71.00 29.00 0.2033 0.3677 0.4321

Naive Bayes 71.68 28.32 0.2857 0.3272 0.4534 J48 70.50 29.50 0.2467 0.3467 0.4796

NBT 70.98 29.02 0.2465 0.3265 0.4753 Logistic 75.20 24.80 0.3750 0.3098 0.4087

PART 71.33 28.67 0.1995 0.3650 0.4762 Naive Bayes 75.40 24.60 0.3813 0.2936 0.4201

RBFN 70.98 29.02 0.2177 0.3574 0.4443 NBT 75.50 24.50 0.3918 0.3102 0.4221

SMO 69.58 30.42 0.1983 0.3042 0.5515 PART 70.20 29.80 0.2767 0.3245 0.4974

RBFN 74.00 26.00 0.3340 0.3388 0.4204

SMO 75.10 24.90 0.3654 0.2490 0.4990

For the breast cancer dataset J48 decision tree

performs the best followed by ADTree and decision

table. J48 provides the highest percentage of correctly For the credit-g data set Bayes network and

classified instances for bank dataset. The Kappa NBTree performs the best followed by Nave Bayes

Statistic value for J48 is much closer to 0 (i.e. 0.2826) and RBFN. Bayes network and NBTree has the same

which is following to Kappa Statistic value of ADTree, number of correctly classified instances percentage.

which indicates that no more agreement is expected Both algorithms has almost same values for Kappa

by chance. But J48 has the highest percentage of statistics , mean absolute error and root mean squared

correctly classified instances than the ADTree and other error. NBTree is a hybrid approach based on Nave

algorithms. The values for mean absolute error and root Bayes and decision tree where as Bayes network is

mean squared error of J48 is almost closer to ADTree based on Nave Bayes only.

and Logistic.

Dataset Figure 4: Graph of KS, MAE and RMSE for Credit-g Dataset

Comparison of Classification Algorithms using WEKA on Various Datasets 89

The diabetes data set has 9 attributes (number of times In this paper we have studied and compared Bayes

pregnant, plasma glucose concentration, blood network, Nave Bayes, SMO, RBFNetwork, Logistic,

pressure, triceps skin fold thickness, serum insulin, Decision Trees (J48), ADTree, NBTree and Decision

body mass index, diabetes pedigree function and age) table algorithms on five data sets in WEKA. Overall

and 768 data instances. The instances are classified observation is that no algorithm performs the best for

into two classes whether the women is tested positive every dataset. For the bank and breast cancer dataset

for diabetes or not. the J48 has highest correctly classified instances than

remaining algorithms. For credit-g and diabetes dataset

Table 6

nave bayes is in first five algorithms. This shows that

Results from WEKA for Diabetes Dataset

there is no single classification algorithm which can

Algorithm CCI (%) ICI (%) KS MAE RMSE provide the best predictive model for all datasets. The

ADT 72.92 27.08 0.3736 0.3613 0.4195 accuracy of predictive model is affected by the

BayesNet 74.35 25.65 0.4290 0.2987 0.4208 selection of attribute. With this we can conclude that

DT 71.22 28.78 0.3492 0.3448 0.4277

the different classification algorithms are designed to

J48 73.83 26.17 0.4164 0.3158 0.4463

perform better for certain types of dataset.

Logistic 77.21 22.79 0.4734 0.3094 0.3954 REFERENCES

Naive Bayes 76.30 23.70 0.4664 0.2841 0.4168

[1] Daniel T. Larose, Data Mining Methods and Models,

NBT 74.35 25.65 0.4260 0.3099 0.4280 John Wiley & Sons, INC Publication, Hoboken, New Jersey

PART 75.26 24.74 0.4390 0.3101 0.4149 (2006).

RBFN 75.39 24.61 0.4303 0.3448 0.4191 [2] Xindog Wu, Vipin Kumar et al., Top 10 Algorithms in Data

Mining, Knowledge and Information Systems, 14(1), 1-37

SMO 77.34 22.66 0.4682 0.2266 0.4760 (2008).

[3] Andrew Secker, Matthew N. Davies et al., An Experimental

Comparison of Classification Algorithms for the Hierarchical

For the diabetes dataset SMO performs the best Prediction of Protein Function, Expert Update (the BCS-

followed by logistic and Nave Bayes. SMO has the SGAI) Magazine, 9(3), 17-22, (2007).

highest percentage of correctly classified instances [4] Martin Weis, Till Rumpf, Roland Gerhards, Lutz Plmer,

Comparison of Different Classification Algorithms for Weed

and higher value for Kappa statistics than other Detection from Images based on Shape Parameters, ATB

algorithms. The mean absolute error is lesser as SMO Publication Volume 69, ISSN 00947-7314, Page No. 53-64

gives the closer predictions for diabetes dataset and (2007).

root mean squared error is not less as it indicates that [5] Ryan Potter, Comparison of Classification Algorithms Applied

there is a possibility of variance in prediction similar to Breast Cancer Diagnosis and Prognosis, Wiley Expert

Systems, 24(1), 17-31, (2007).

which is similar to other algorithms for diabetes

[6] Yoav Freund and Llew Mason, The Alternative Decision Tree

dataset. Learning Algorithm International Conference on Machine

Learning, 124-133, (1999).

[7] Daryle Niedermayer, An Introduction to Bayesian Networks

and their Contemporary Applications Springer Studies in

Computational Intelligence, 56, 117-130, (2008).

[8] Jianing Shi, Wotao Yin et.al.,Fast Hybrid Algorithm for Large-

Scale !1-Regularized Logistic Regression, Journal of

Machine Learning Research, 11, 713-741, (2010).

[9] Kim Larsen, Generalized Naive Bayes Classifiers SIGKDD

Explorations. 7(1), 76-81, (2005).

[10] Manuel J. Fonseca, Joaquim A. Jorge, NB-Tree: An Indexing

Structure for Content-Based Retrieval in Large Databases,

Proceedings of the Eighth International Conference on

Database Systems for Advanced Applications, Pages 267-276,

(2003).

[11] Frank, Eibe Witten, Ian H, Generating Accurate Rule Sets

without Global Optimization, Proceedings of the Fifteenth

International Conference on Machine Learning, 144-151,

Figure 5: Graph of KS, MAE and RMSE for Diabetes Dataset (1998).

90 International Journal of Computer Science and Information Technology

[12] Adrian G. Bors, I. Pitas, Introduction to RBF Network, [13] JingminWang, KanzhangWu, Study of the SMO Algorithm

Online Symposium for Electronics Engineers, 1(1), 1-7 Applied in Power System Load Forecasting Springer LNCS,

(2001). pp. 1022-1026, (2006).

- Data HousingUploaded byAmanya Allan
- datamining.pdfUploaded byAJay Pratap Singh Bhadauriya
- Lecture 01Uploaded bysrikanth.mujjiga
- Classifying-Spending-Behavior-using-Socio-Mobile-Data.pdfUploaded byjuanloprada
- Mitchell Ch6 LectureUploaded byJohn Kent
- Quiz Feedback _ CourseraUploaded byNaveed Khan
- SyllabusUploaded bycprobbiano
- Data ManagementUploaded byhaddi awan
- els2015_2edUploaded byjacinto
- IML 2019IA 2ReportSubmissionFormatUploaded byShreyash Nandgaonkar
- A Significant Review of Different Drought Indices for Predicting Agricultural DroughtsUploaded byeditor3854
- k Kurata00inet ComparisonsRenoVegasUploaded bysamadazad
- V2I3001Uploaded byeditor_ijarcsse
- INSOFE-Comprehensive Curriculum on Big Data AnalyticsUploaded bylokssa
- IJETTCS-2016-04-25-94Uploaded byAnonymous vQrJlEN
- Machine Learning for Text and Web MiningUploaded byswapnil_022
- classifying lessonUploaded byapi-299878153
- Lecture 05 Classification of ServicesUploaded byNupur Agarwal Jain
- Scha Pire 2003Uploaded byLuisFernandoValenzuela
- Wheat Leaf Disease Detection Using Image ProcessingUploaded byEditor IJLTEMAS
- A Support Vector Machine and Information Gain based Classification Framework for Diabetic Retinopathy ImagesUploaded byseventhsensegroup
- Classification Administration - Configuration and Best PracticesUploaded byemail2jha1740
- Machine LearningUploaded byAmit Patra
- Negative Selection Inspired Machine Learning Approach for Damage DetectionUploaded byAnonymous vQrJlEN
- Network Intrusion Detection Demystified Using Classification TreesUploaded byEditor IJRITCC
- S1-2014-269193-abstract (2)Uploaded byArga Fondra Oksaping
- Class 11 - Control Charts for Attributes3(1 of 2)Uploaded byMatthew Smith
- Predictive Framework for E-Commerce Product Categorization on Azure CloudUploaded byIRJET Journal
- Labeling Customers Using Discovered Knowledge Case Study: Automobile Insurance IndustryUploaded byWilliam Scott
- A Data Mining Analysis Over Psychiatric Database for Mental Health ClassificationUploaded byRahul Sharma

- Task 1Uploaded byCrystal Welch
- Stat Assignment 1Uploaded byCrystal Welch
- 1 TG56MIC113OPE 1Group ProjectUploaded byCrystal Welch
- Lecture 5Uploaded byCrystal Welch
- 1553571317_q0.docxUploaded byCrystal Welch
- Solution to EM201792ETH1130TAX.docxUploaded byCrystal Welch
- 1559983277 Super Grain Mini-Project-guidelinesUploaded byCrystal Welch
- Transtutor TableUploaded byCrystal Welch
- Marie and IsaUploaded byCrystal Welch
- bank-soal-bmfu-4542.pdfUploaded byCrystal Welch
- 1560712900_ACC_345_Milestone_OneUploaded byCrystal Welch
- 1553571317_q0Uploaded byCrystal Welch
- Lecture 4.pdfUploaded byCrystal Welch
- ACCT6003 FAP Assessment Brief Part BUploaded byCrystal Welch
- FibonacciUploaded byCrystal Welch
- 2018 Jan Fri 1516916457 Quiz Special ProblemUploaded byCrystal Welch
- Net PaperUploaded byNileshNemade
- 2018-Jan-Fri-1516916457_Transport properties of dilute ammonia-noble gas mixtures from new intermolecular potential energy functions (1).pdfUploaded byCrystal Welch
- Discrete StructureUploaded byCrystal Welch
- HW-11053715069_0Uploaded byCrystal Welch
- V3Uploaded byCrystal Welch
- 6 EM201745KAR612CFIN 1Finance Prob 7Uploaded byCrystal Welch
- Chapter 3Uploaded byanil.gelra5140
- Assignment FileUploaded byCrystal Welch
- TEST21 (1)Uploaded byCrystal Welch
- 02 ITC571 Bibliography TemplateUploaded byCrystal Welch

- Vl10a Documentacion v.001Uploaded byaleksegil
- Randomizing Quick SortUploaded byom4perfection
- Working With ProjectionsUploaded bysamtux
- Signals and SystemsUploaded byalirrehman
- Point Cloud ViewersUploaded byfino3
- BPMN Poster A3 Ver 1.0.9Uploaded byemedinilla
- L6 SPE-1305-PA Chatas A.T. Unsteady Spherical Flow in Petroleum Reservoirs.pdfUploaded bySolenti D'nou
- Gaussian single-valued neutrosophic numbers and its application in multi-attribute decision makingUploaded byAnonymous 0U9j6BLllB
- Connection_ffbbn.pptUploaded byAbh Abh
- Ge Stator Earth FaultUploaded byMario
- 6 - Jensen Et Al. - Chemoreflex Control of Breathing During Wakefulness in Healthy Men and WomenUploaded byAnaKpot
- Heat Recovery for Shower DrainsUploaded byDota Ng
- 1-s2.0-S0376738812003717-mainUploaded byFadhilatul Adha
- Lab4TheveninStevensUploaded bykarthikhrajv
- Typescript_Typings_Crash_Course.pdfUploaded byyjr_yogesh
- Natali_Event and PoiesisUploaded byFlávio N De Paula
- VoomPC AssemblyUploaded byjavamagna
- lenzs_lawUploaded byMunie Rosnan
- Netto (2013) a Urbanidade Como Devir Do Urbano-libreUploaded byJefferson Tomaz
- Forces KS3Uploaded byJonathan Wilmshurst
- L11 - Premier Int PF525(Chapter2) Lab ManualUploaded byrajavinugmailcom
- LEED AnalysisUploaded byGray Don
- Solar Cell ReportUploaded byAnkur Anand
- Using MARS.docUploaded byPhingsten Fuller
- IPDS_R0_HPCL-SBRUploaded byBalaji Krishnan
- Structural Design of RCC BridgeUploaded byShambhu Sah
- Mark 4E2 ServiceUploaded byRajan Mullappilly
- 10.1109@TPWRS.2003.821464Uploaded byOleleOle
- 400-101 prepaway dumpsUploaded byShan Malik
- 0321210255_ch01Uploaded byHoroNoYoitsu