You are on page 1of 13

International Journal of Computer (IJCET), INTERNATIONAL Issue Engineering and Technology IAEME ISSN 0976 6367(Print), ISSN JOURNAL

AL OF COMPUTER ENGINEERING & 0976 6375(Online) Volume 3, 3, October-December (2012), TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 3, Issue 3, October - December (2012), pp. 155-167 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2012): 3.9580 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

IMPROVING THE PERFORMANCE OF K-NEAREST NEIGHBOR ALGORITHM FOR THE CLASSIFICATION OF DIABETES DATASET WITH MISSING VALUES
Y. Angeline Christobel1, P. Sivaprakasam2, 1 Research Scholar, Department of Computer Science, Karpagam University, Coimbatore, India 2 Professor, Department of Computer Science, Sri Vasavi College, Erode, India ABSTRACT In todays world, people get affected by many diseases which cannot be completely cured. Diabetes is one such disease and is now a big growing health problem. It leads to the risk of heart attack, kidney failure and renal disease. The techniques of data mining have been widely applied to extract knowledge from medical databases. In this paper, we evaluated the performance of knearest neighbor(kNN) algorithm for classification of Diabetes data. We considered the data imputation, scaling and normalization techniques to improve the accuracy of the classifier while using diabetes data, which may contain lot of missing values. We selected to explore KNN because, it is very simple and faster than most of the complex classification algorithms. To measure the performance, we used Accuracy and Error rate as the metrics. We found that data imputation method will not lead to higher accuracy; instead it will give correct accuracy for missing values and imputation along with a suitable data preprocessing method increases the accuracy. Keywords: Data Mining, Classification, kNN, Data Imputation Methods, Data Normalization and Scaling

1. INTRODUCTION
Hospitals and medical institutions are finding difficult to extract useful information for decision support due to the increase of medical data. So the methods for efficient computer based analysis are essential. It has been proven that the benefits of introducing machine learning into medical analysis are to increase diagnostic accuracy, to reduce costs and to reduce human resources. Earlier detection of disease will aid in increased exposure to required patient care and improved cure rates [17]. Diabetes is one of the most deadly diseases observed in many of the nations. According to the World Health Organization (WHO) [18] there are approximately millions of
155

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

people in this world are suffering from diabetes. Data classification is an important problem in engineering and scientific disciplines such as biology, psychology, medicines, marketing, computer vision, and artificial intelligence. The goal of the data classification is to classify objects into a number of categories or classes. A lot of research work has been done on Pima Indian diabetes dataset. In [8], Jayalakshmi and Santhakumaran used the ANN method for diagnosing diabetes, using the Pima Indian diabetes dataset without missing data and obtained 68.56% classification accuracy. Manaswini Pradhan et al.[12] suggested an Artificial Neural Network (ANN) based classification model for classifying diabetic patients. Best accuracy being 72% with average accuracy of 72.2%. Umar Sathic Ali, Dr.C.Jothi Ventakeswaran[13] proposed an improved evidence theoretic KNN algorithm which combines Dempster Shafer theory of evidence and K nearest neighbouring rule with distance metric based neighborhood with the average accuracy of 73.57%. Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya [14] proposed a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. To guide the sampling process, multiple classification algorithms are used. The classification algorithms used in the hybrid system composition includes Decision Tree (J48), k-Nearest Neighbor (kNN), Naive Bayes (NB), Random Forest (RF) and Logistic Regression (LOG). Four medical datasets such as blood, survival, breast cancer and Pima Indian Diabetes from UCI Machine Learning Repository and a genome wide association study (GWAS) dataset from the genotyping of Single Nucleotide Polymorphisms (SNPs) of Age-related Macular Degeneration (AMD) are used. The different methods of the study includes PSO (Particle Swam based Hybrid System), RO(Rndom OverSampling), RU (Random UnderSampling) and Clustering. The metrics used are AU(Area under ROC curve), F-measure and Geometric mean.Particle Swam based Hybrid System (PSO) gives an average accuracy of 71.6%, RU 67.3%, RO 65.8% and cluster 64.9% with KNN Classifier for Pima Indian Diabetes. Also with other classifiers and dataset, the study shows that PSO gives better accuracy. In[15], Biswadip Ghosh applies FCP (Fuzzy Composite Programming) to build a diabetes classifier using the PIMA diabetes dataset. He evaluated the performance of the Fuzzy classifier using Receiver Operating Characteristic (ROC) Curves and compared the performance of the Fuzzy classifier against a Logistic Regression classifier. The results show that FCP classifier is better when the calculated AUC values are compared. The classifier based on logistic regression has an AUC of 64.8%, while the classifier based on FCP has an AUC of 72.2%. He proved that the performance of the fuzzy classier was found to be better than a logistic regression classifier. Quinlan applied C4.5 and it was 71.1% accurate (Quinlan 1993). In [3], Wahba's group at the University of Wisconsin applied penalized log likelihood smoothing spline analysis of variance (PSA). They eliminated patients with glucose and BMIs of zero leaving n=752. They used 500 for the training set, and the remaining 252 as the evaluation set which showed an accuracy of 72% for the PSA model and 74% for a GLIM model (Wahba, Gu et al. 1992). Michie et al. used 22 algorithms with 12-fold cross validation reported the following accuracy rates on the test set and: Discrim 77.5%, Quaddisc 73.8%, Logdisc 77.7%, SMART 76.8%, ALLOC80 69.9%, k-NN 67.6%, CASTLE 74.2%, CART 74.5%, IndCART 72.9%, NewID 71.1%, AC2 72.4%, Baytree 72.9%, NaiveBay 73.8%, CN2 71.1%, C4.5 73%, Itrule 75.5%, Cal5 75%, Kohonen 72.7%, DIPOL92 77.6%, Backprop 75.2%, RBF 75.7%, and LVQ 72.8% (Michie, Spiegelhalter et al. 1994, p. 158)
156

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

J. Oates (1994) used Multi-Stream Dependency Detection (MSDD) algorithm on twothirds of the dataset for training. Accuracy on the one-third of evaluation was 71.33%. Although the cited articles use different techniques, the average accuracy is 70.64% for Pima Indian Diabetes dataset. The representation of lot of missing values by zeros itself forms a characteristic and hence affects the accuracy of the classifier. The main objective of this paper is to explore the ways to improve the classification accuracy of a classification algorithm using suitable techniques such as data imputation, scaling and normalization for the classification of diabetes data which may contain lot of missing values.

2. DATA PREPROCESSING
Data in the real world is incomplete, noisy, inconsistent and redundant. The knowledge discovery during the training phase is difficult if there is much irrelevant and redundant information. Data preprocessing is important to produce quality mining results. The main tasks in data preprocessing are data cleaning, data integration, data transformation, data reduction and data discretization. The final dataset used will be the product of data preprocessing.

2.1. Missing Values and Imputation of Missing Values


Many existing, industrial and research datasets contain missing values. They are introduced due to reasons, such as manual data entry procedures, equipment errors and incorrect measurements. The most important task in data cleaning is processing missing values. Many methods have been proposed in the literature to treat missing data [21,22]. Missing data should be carefully handled; otherwise bias might be introduced into the knowledge induced and affect the quality of the supervised learning process or the performance of classification algorithms [3, 4]. Missing values imputation is a challenging issue in machine learning and data mining [2]. Due to the difficulty with missing values, most of the learning algorithms are not well adapted to some application domains (for example Web Applications) because the existed algorithms are designed with the assumption that there are no missing values in datasets. This shows the necessity for dealing with missing values. The satisfactory solution to missing data is good database design but good analysis can help to reduce the problems [1]. Many techniques can be used to deal missing data. Depending on the problem domain and the goal for the data mining process, the right technique should be selected. In literature, the different approaches to handle missing values in dataset are given as: 1. Ignore the tuple 2. Fill in the missing values manually 3. Use a global constant to fill in the missing values 4. Use attribute mean to fill in the missing values 5.Use the attribute mean for all samples belonging to the same class as the given tuple

2.2. Mean substitution


The popular imputation method to fill in the missing data values is to use a variables mean or median. This method is suitable only if the data are MCAR(Missing Completely At Random). This method creates a spiked distribution at the mean in frequency distributions. It lowers the correlations between the imputed variables and the other variables and underestimates variance.

157

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

The Simple Data Scaling Algorithm The following algorithm explains the data scaling method. Let D = { A1, A2, A3, .. An } Where D is the set of unnormalized data Ai is the ith attribute column of values of m- is the member of rows (records) n - is the number of attributes. Function Normalize(D) Begin For i=1 to n { Maxi max(Ai) Minimin(Ai) For r =1 to m { Air Air-Mini Air Air/ Maxi Where Air is the element of Ai at row r } } Finally we will have the scaled data set. End The Mean Substitution Algorithm The following algorithm explains the very commonly used form of mean substitution method [9] Let D = { A1, A2, A3, .. An } Where D is the set of data with missing values Ai is the ith attribute column of values of D with missing values in some or all columns n - is the number of attributes. Function MeanSubstitution(D) Begin For i=1 to n { ai Ai mi where ai is the column of attributes without missing values mi is the set of missing values in Ai (missing values denoted by a symbol) Let i be the mean of ai Replace all the missing elements of Ai with i } Finally we will have the imputed data set. End
158

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

2.3. k-Nearest Neighbor(knn) Classification Algorithm


KNN classification classifies instances based on their similarity. It is one of the most popular algorithms for pattern recognition. It is a type of Lazy learning where the function is only approximated locally and all computation is deferred until classification. An object is classified by a majority of its neighbors. K is always a positive integer. The neighbors are selected from a set of objects for which the correct classification is known. The kNN algorithm is as follows: 1. Determine k i.e., the number of nearest neighbors 2. Using the distance measure, calculate the distance between the query instance and all the training samples. 3. The distance of all the training samples are sorted and nearest neighbor based on the k minimum distance is determined. 4. Since the kNN is supervised learning, get all the categories of the training data for the sorted value which fall under k. 5. The prediction value is measured by using the majority of nearest neighbors. The Pseudo Code of kNN Algorithm Function kNN(train_patterns , train_targets, test_patterns ) Uc - a set of unique labels of train_targets; N - size of test_patterns for i = 1:N, dist:=EuclideanDistance(train_patterns, test_patterns(i)) idxs := sort(dist) topkClasses := train_targets(idxs(1:Knn)) c := DominatingClass (topkClasses) test_targets(i) := c end

2.4. Validating the Performance of the Classification Algorithm


Classifier performance depends on the characteristics of the data to be classified. Performance of the selected algorithm is measured for Accuracy and Error rate. Various empirical tests can be performed to compare the classifier like holdout, random sub-sampling, k-fold cross validation and bootstrap method. In this study, we have selected k-fold cross validation for evaluating the classifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subset or folds d1,d2,,dk, each approximately equal in size. The training and testing is performed k times. In the first iteration, subsets d2, , dk collectively serve as the training set in order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1, d3,, dk and tested on d2; and so no[19]. The accuracy of the classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data [19]. The Accuracy and Error rate can be defined as follows: Accuracy = (TP+TN) / (TP + FP + TN + FN) Error rate = (FP+FN) / (TP + FP + TN + FN) Where
159

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

TP is the number of True Positives TN is the number of True Negatives FP is the number of False Positives FN is the number of False Negatives

3. EXPERIMENTAL RESULTS 3.1. Pima Indians Diabetes Dataset


The Pima (or Akimel O'odham) are a group of American Indians living in southern Arizona. The name, "Akimel O'odham", means "river people". The short name, "Pima" is believed to have come from the phrase pi 'ai mac or pi mac, meaning "I don't know," used repeatedly in their initial meeting with Europeans [25]. According to World Health Organization (WHO) [18 ], a population of women who were at least 21 years old of Pima Indian Heritage was tested for diabetes. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. Number of Instances: 768 Number of Attributes: 8 (Attributes) plus 1 (class label) All the attributes are numeric-valued Sl No 1 2 3 4 5 6 7 8 9 Attribute Pregnant Glucose Pressure Triceps Insulin Mass Pedigree Age Diabetes Explanation Number of times pregnant Plasma glucose concentration (glucose tolerance test) Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable (test for diabetes)

Class Distribution: Class value 1 is interpreted as "tested positive for diabetes" Class Value : 0 - Number of instances - 500 Class Value : 1 - Number of instances 268 While the UCI repository index claims that there are no missing values, closer inspection of the data shows several physical impossibilities, e.g., blood pressure or body mass index of 0.They should be treated as missing values to achieve better classification accuracy. The following 3D plot clearly shows the complex distribution of the positive and negative records in the original Pima Indians Diabetes Database which will complicate the classification task.
160

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

Fig 1. The 3D plot of original Pima Indians Diabetes Database

3.2. Results of the 10 Fold Validation


The following table shows the results in terms of accuracy, error and run time in the case of 10 fold validation Table 1. 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.71 71.32 71.84 Error Time 28.29 0.2 28.68 0.28 28.16 0.23

As shown in the following graphs, the performance was considerably improved while using imputed and scaled data. The performance after imputation is almost equal or little bit lower than using normal data. The reason is, the original data contains lot of missing values and the missing values represented by 0 are treated as a significant feature by the classifier. Even though the performance of original data is higher than that of imputed data, we should not consider because it will not be accurate. If we train the system with original data which has lot of missing values and use it for classifying real world data which may not contain any missing value, then the system will not classify the ideal data correctly.

161

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

Performance in terms of Accuracy 71.9 71.8 71.7 71.6 71.5 71.4 71.3 71.2 71.1 71 Actual Data After After Imputation Imputation and Scaling Accuracy(%) Error(%)

Fig 2. Comparison of Accuracy 10 Fold Valuation The accuracy of classification has been significantly improved after imputation and scaling.
Performance in terms of Error 28.8 28.7 28.6 28.5 28.4 28.3 28.2 28.1 28 27.9 Actual Data After After Imputation Imputation and Scaling

Fig 3. Comparison of Error 10 Fold Valuation The classification error has been significantly reduced after imputation and scaling.

3.3. Results of the Average of 2 to 10 Fold Validation


The following table shows the average of results in terms of accuracy, error and run time in the case of 2 to 10 fold validation Table 2. Average of 2 to 10 Fold Validation Results Performanc After e with Actual After Imputation different Data Imputation and Scaling Metrics Accuracy 71.94 71.26 73.38 Error Time 28.06 0.19 28.74 0.19
162

26.62 0.19

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME
Performance in terms of Accuracy 74 73.5 73 72.5 72 71.5 71 70.5 70 Actual Data After Imputation After Imputation and Scaling

Fig 4. Comparison of Accuracy Avg. of 2-10 Fold Valuations The accuracy of classification has been significantly improved after imputation and scaling. The average of 2 to 10 fold validations clearly shows the difference in accuracy.
Performance in terms of Error 29 28.5 28 27.5 27 26.5 26 25.5 Actual Data After Imputation After Imputation and Scaling

Fig 5. Comparison of Error Avg. of 2-10 Fold Valuations The classification error has been significantly reduced after imputation and scaling. The average of 2 to 10 fold validations clearly shows the difference in classification Error. The graphs above show that imputation alone does not give higher accuracy but imputation with a data preprocessing technique can improve the accuracy of a dataset which has missing values. We also compared our improved KNN with Wekas Implementation of KNN. The results show that improved kNNs accuracy with imputation and scaling method is 3.2% higher than the standard Weka's Implementation of kNN.

4. CONCLUSION AND SCOPE FOR FURTHER ENHANCEMENTS


The result shows that the characteristics and distribution of data and the quality of input data had a significant impact on the performance classifier. Performance of the kNN classifier is measured in terms of Accuracy, Error Rate and run time using 10-fold validation method and average of 210 fold validation. The application of imputation method and data normalization had obvious
163

Error(%)

Accuracy(%)

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

impact on classification performance and significantly improved the performance of kNN. The performance after imputation is almost equal or little bit lower than using original data. The reason is, the original data contains lot of missing values and the missing values represented by 0 are treated as a significant feature by the classifier. So the data imputation method not always lead to so called good results they only give correct results. It means imputation will not always lead to higher accuracy for imputed missing values. But Imputation along with a suitable data preprocessing method increases the accuracy. Our improved kNNs accuracy with imputation and scaling method is 3.2% higher than the standard Weka's Implementation of kNN. To improve the overall accuracy, we have to improve the performance of the classifier or use a good feature selection method. Future works may address hybrid classification model using kNN with other techniques. Even though the achieved accuracy of kNN based method is little bit lower than some of the previous methods, still there are scope for improving the kNN based method. For example, we may use another distance metric function instead of the standard Euclidean distance function in the distance calculation part of kNN for improving accuracy. Even we may use any evolutionary computing technique such as GA or PSO for feature selection or Feature weight selection for attaining maximum possible accuracy of classification with kNN. So, future work may address the ways to incorporate these ideas along with the standard kNN. ANNEXURE RESULTS OF K-FOLD VALIDATION The following table shows the k-fold validation results where k is changed from 2 to 10. We used the average of this results as well as the results corresponding to k=10 for preparing graphs in the previous section IV. Table 3. Results With Actual Pima Indian Diabetes Dataset k 2 3 4 5 6 7 8 9 10 avg Accuracy 72.53 74.35 71.61 70.98 72.40 70.90 72.14 70.85 71.71 71.94 Error 27.47 25.65 28.39 29.02 27.60 29.10 27.86 29.15 28.29 28.06 Time 0.13 0.20 0.16 0.17 0.19 0.25 0.20 0.19 0.20 0.19

164

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

Table 4. Results With Imputed Data k 2 3 4 5 6 7 8 9 10 Avg Accuracy 71.22 70.44 70.83 71.24 71.88 71.56 71.61 71.24 71.32 71.26 Error 28.78 29.56 29.17 28.76 28.13 28.44 28.39 28.76 28.68 28.74 Time 0.16 0.19 0.16 0.14 0.19 0.20 0.17 0.19 0.28 0.19

Table 5. Results With Imputed and Scaled/Normalized Data k 2 3 4 5 6 7 8 9 10 Avg Accuracy 74.74 74.09 71.88 73.59 72.92 74.31 74.09 72.94 71.84 73.38 Error 25.26 25.91 28.13 26.41 27.08 25.69 25.91 27.06 28.16 26.62 Time 0.22 0.13 0.16 0.17 0.17 0.16 0.22 0.22 0.23 0.19

5. REFERENCES
[1] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning

databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
[2]

Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge. Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D. H. Wolpert (1995), The Mathematics of Generalization, 331-359, Addison-Wesley, Reading, MA.

[3] Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a.

[4] S. Sahan, K. Polat, H. Kodaz, and S. Gunes, "The medical applications of attribute weighted artificial

immune system (awais): Diagnosis of heart and diabetes diseases," in ICARIS, 2005, p. 456 468.
[5] M. Salami, A. Shafie, and A. Aibinu, "Application of modeling tech- niques to diabetes diagnosis,"

165

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

in IEEE EMBS Conference on Biomedical Engineering & Sciences, 2010.


[6] T. Exarchos, D. Fotiadis, and M. Tsipouras, "Automated creation of transparent fuzzy models based

on decision trees application to diabetes diagnosis," in 30th Annual International IEEE EMBS Conference, 2008.
[7] M. Ganji and M. Abadeh, "Using fuzzy ant colony optimization fordiagnosis of diabetes disease,"

IEEE, 2010.
[8] T. Jayalakshmi

and A. Santhakumaran, "A novel classification methodfor diagnosis of diabetes mellitus using artificial neural networks," in International Conference on Data Storage and Data Engineering, 2010 Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications Issn-09758887, 2011

[9] R.S. Somasundaram and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for

[10] Peterson, H. B., Thacker, S. B., Corso, P. S., Marchbanks, P. A. & Koplan, J. P. (2004). Hormone

therapy: making decisions in the face of uncertainty. Archives of Internal Medicine, 164(21), 2308-12.
[11] Kietikul Jearanaitanakij,Classifying Continous Data Set by ID3 Algorithm, Proceedings of fifth

International Conference on Information Communication and Signal Processing, 2005.


[12] Manaswini Pradhan1 , Dr. Ranjit Kumar Sahu2 , Predict the onset of diabetes disease using Artificial

Neural Network (ANN), International Journal of Computer Science & Emerging Technologies (EISSN: 2044-6004) 303 Volume 2, Issue 2, April 2011
[13] P.Umar Sathic Ali, Dr.C.Jothi Ventakeswaran, Improved Evidence Theoretic kNN Classifier based

on Theory of Evidence, International Journal of Computer Applications (0975 8887) Volume 15 No.5, February 2011
[14] Pengyi Yang, LiangXu,Bing B Zhou, Zili Zhang,and Albert Y Zomaya, A particle swarm based

hybrid system for imbalanced medical data sampling, BMC Genomics. 2009; 10(Suppl 3): S34.
[15] Biswadip

Ghosh, Using Fuzzy Classification for Chronic disease, Indian Journal of Economics&Business, March 2012 Methods,International Journal of Computer Information Systems, Vol. 3, No. 2, 2011

[16] Angeline Christobel, SivaPrakasam,An Empirical Comparison of Data mining Classification [17] Roshawnna Scales, Mark Embrechts, Computational intelligence techniques for medical

diagnostics.
[18] World Health Organization. Available: http://www.who.int [19] J. Han and M. Kamber,Data Mining Concepts and Techniques, Morgan Kauffman Publishers, USA,

2006.
[20] Agrawal, R., Imielinski, T., Swami, A., Database Mining:A Performance Perspective, IEEE

Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993.


[21] E. Acuna and C. Rodriguez, The treatment of Missing Values and its eect in the classi er accuracy,

In: D. Banks, L. House, F.R. McMorris, P. Arabie, W. Gaul (Eds). Classi cation, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, 2004
[22] G.E.A.P.A. Batista and M.C. Monard, An analysis of four Missing Data treatment methods for

supervised learning, Applied Arti cial Intelli-gence 17 (2003)


[23] Chen, M., Han, J., Yu P.S., Data Mining: An Overview from Database Perspective, IEEE

Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. 166

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME [24] Angeline Christobel, Usha Rani Sridhar, Kalaichelvi Chandrahasan, Performance of Inductive

Learning Algorithms on Medical Datasets, International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011
[25] Awawtam. "Pima Stories of the Beginning of the World." The Norton Anthology of American

Literature. 7th ed. Vol. A. New York: W. W. Norton &, 2007. 22-31
[26] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. [27] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica

31(2007) 249-268, 2007


[28] J. R. Quinlan. Improved use of continuous attributes in c4.5, Journal of Artificial Intelligence

Research, 4:77-90, 1996.


[29] J.R. Quinlan, Induction of decision trees, In Jude W.Shavlik, Thomas G. Dietterich, (Eds.),

Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning, vol. 1, 1986, pp 81106.
[30] Salvatore Ruggieri,Efficient C4.5 Proceedings of IEEE transactions on knowledge and data

Engineering,Vo1. 14,2,No.2, PP.438-444,20025


[31] P. Domingos, M. Pazzani, On the optimality of the simple Bayesian classifier under Zero-one loss,

Machine learning 29(2-3)(1997) 103-130.11


[32] Vapnik, V.N., The Nature of Statistical Learning Theory,

1st ed., Springer-Verlag,New York,

1995.
[33] Michael J. Sorich, John O. Miners,Ross A. McKinnon,David A. Winkler, Frank R. Burden, and Paul

A. Smith, Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism, by human UDP- Glucuronosyltransferase Isoforms
[34] UCI Machine Learning Repository

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29]
[35] XindongWu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda

Geoffrey J. McLachlan Angus Ng Bing Liu Philip S. Yu Zhi-Hua Zhou Michael Steinbach David J. Hand Dan Steinberg, Top 10 algorithms in data mining Springer 2007
[36] Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning,

Proc.22nd International Conference on Machine Learning (ICML'05).


[37] Thair Nu Phyu, Survey of Classification Techniques in Data Mining MultiConference of Engineers

and Computer Scientists, 2009 Vol I IMECS 2009, Hong Kong


[38] Arbach, L.;

Reinhardt, J.M.; Bennett, D.L.; Fallouh, G.; Iowa Univ., Iowa City, IA, USA Mammographic masses classification: comparison between backpropagation neural network (BNN), K nearest neighbors (KNN), and human readers, 2003 IEEE CCECE Classification of Training Web Pages, International Journal of Computer Science, 4(1):16-23,2009.

[39] D. Xhemali, C.J. Hinde, and R.G. Stone.,Nave Bayes vs. Decision Trees vs. Neural Networks in the

167