Professional Documents
Culture Documents
Abstract —Data mining techniques (DMTs) are very help full to Saharan countries. It is not difficult to guess how much
predict the medical datasets at an early stageto safe human life. diabetes is very serious and chronic.There are different
Large amount of medical datasets areopen in different data countries, organization, and different health sectors worry
sources which used to in the real world application. Machine about this chronic disease control and prevent before the
learning isa prediction on disease data. Currently, Diabetes person died that means the early presentation of diabetes in
Disease (DD) is the leading cause of death over all the world. To order to save human life. Eating is also one factor for diabetes
cluster and predict symptoms in medical data, various data
diseases and also, exercise used for healthy even a person live
mining techniques wereused by different researchers in different
with diabetes the patient can recover from the disease by
time. A total of 768 records, data set from PIDD (Pima Indian
Diabetes Data Set) which is access from online source. In the doing exercise Diabetes diseases have the power or ability to
proposed system mostknown predictive algorithms areapplied damage different parts of the human being body, from those
SVM, Naïve Net,DecisionStump, and Proposed Ensemble method human body parts which are affected by diabetes are listed as
(PEM).Anensemble hybrid modelbycombining the individual follow:-human heart, human eye, human kidney, and human
techniques/methods into one we made Proposed Ensemble nerves [39]. As it indicates it is easy to guess how much it is
method (PEM). Theproposed ensemble method (PEM) provides chronic and dangerous diseases that shorts human life. . Tao et
high accuracy of 90.36% al. [2] Algorithms which are used in machine learning have
various power in both classification and predicting. Saba et al.
Keywords—collaborative ; Diabetes; classification; Machine
[12] there is no single technique gives better performance and
learning; Data mining;SVM,; Naïve Net;Decision Stump; PEM
accuracy for all diseases, whereas one classifier provides or
I. INTRODUCTION shows highest performance in a given dataset, another method
or approach outdoes the others for other diseases. The new
Currently in a global world, there are so many chronic
study or the proposed study concentrate on a novel
diseases are distributed throughout the world, both in the
combination or hybridization of different classifiers for
developing and developed country such serious disease are
diabetes Mellitus (DD) classification and prediction, thus
distributed. From those serious diseases, Diabetes mellitus is
overcoming the problem of individual or single classifiers.
one of the chronic diseases in the world which cut human life
The new proposed study follows the different machine
at early age. Diabetes Mellitus (DM) gets its name by health
learning techniques (MLTs) to predict diabetes Mellitus (DM)
professionals’ .At this time diabetes disease increases rapidly
at an early stage to save human life. Such algorithms are
within the distance of light like Indian countries and some
871
International Journal of Pure and Applied Mathematics Special Issue
SVM, Naïve Net, Decision Stump, and PEM to predict and Bayes, Random Forest, decision tree, swim, and logistic
increase the prediction accuracy and performance. regression was applied for the prediction purpose of diabetes
mellitus (DM) at early stage. Concentrated on filtering.
II. RELATED WORK Yunsheng et al. [4] KNN and DISKR was used and storage
Song et al. [5]various algorithim was explained using different space was reduced, an instance which has less factor was
parameters such as Glucose, Blood Pressure (BP), Skin eliminated. Removing of outlier increase both performance
Thickness (ST), insulin, Body max index (BMI), Diabetes and accuracy. Swarupa et al. [7].Naive Bayes (NBs) was
Pedigree function(DPF), and age. All parameters were not providing good accuracy with the accuracy value of 77.01%.
included.Only Small sample data used. ANN, EM, GMM, Sajida et al. [9] Adaboost provided better performance and
Logistic regression, and SVM were applied on diabetes accuracy. Pradeep &Dr. Naveen [8] J48 is one of the most
dataset .ANN (artificial neural network) was provided better popular and noted as better accuracy as well as good
accuracy and performance than other algorithm.Xue-HuiMeng performance in this study.Pradeep et al. [11] J48 machine
et al. [38] use different data mining techniques to predict the learning algorithm provided better performance and accuracy.
diabetic diseases using real world data sets by collecting Santhanam andPadmavathi [10] K-means, Genetic
information by distributing questioner. SPSS and weka tools Algorithm, and SVM were applied and increase the
were used for data analysis and prediction respectively. accuracyvalue.Xue-Hui Men et al. [13] J48 was provided high
WeifengXu et al. [3] Different machine learning algorithm performance. Croatia et al. [14] k-nearest neighbour (KNN)
was applied in the prediction of diabetes diseases. From those was applied and provided accuracy value of 70% accuracy.
algorithm RF was provided better accuracy than other data Ramiro et al. [6] wrongtreatment can reduced by applying
mining techniques. Loannis et al. [1] 10 fold cross validation fuzzy rule mechanismand also it helps for doctors as a
was used as evaluation method in three different algorithms s recommender system in order to treat the patient without
Logistic regression, Naïve Bayes, and Svm. From those three making mistake. Saba et al. [12] different data mining
different algorithm svm provided higher accuracy and algorithm was applied from those algorithm Meta classifier
performance than other method. Tao et al. [2] KNN, Naïve provided higher accuracy than single classifier.
III. METHODOLOGY
In diabetic disease there were different research were done.sumarry of common or major findings are given as follow.
872
International Journal of Pure and Applied Mathematics Special Issue
Santhanam and K-means with Genetic Algorithm The integrated clustering and classification of algorithm done and
Padmavathi[10] ,and SVM provided better performance.
10
Pradeep et al.[11] KNN, J48,SVM, and Random Forest J48 providedefficient accuracy by providing 73.82% accuracy than
others before pre-processing. Opposite side KNN and RF in provided
11 good accuracy after pre-processing.
Saba et al.[12] CART ,C4.5 ,Bagging, and ID3 The given algorithm applied on two diabetic datasets.
12
Xue-Hui Men et KNN, Logistic Regression, and J48 78.27% accuracy was measured using this method.
al.[13]
13
Krati et al.[14] KNN 70% and 57% accuracy measured in data tes1 and data test2
respectively.
14
Saravananatha SVM,CART, j48, cart, svm and knn was applied and shown with the accuracy value
n and KNN,andJ48 of 67.15%, 62.28, 65.04 and 53.39 respectively.
15 velmurugan[15]
Yang et al.[16] NB, Bayes network. 72.3% accuracy measured by Bayes network
16
Asma [17] Decision tree Was shown 78.1768% accuracy.
17
Anjli and Varun[18] SVM 72% accuracy measured
18
Thirumal et al.[19] C4.5,SVM,KNN, and Naïve Bayes C4.5 was shown improved accuracy than other with accuracy value of
78.2552%
19
Ayush and CART Was provided accuracy of 75%
Divya[20]
20
Veena and SVM, Decision Stump,NB, and 80.72% accuracy was measured as better by Decision stump
Anjali[21] decision tree
21
Anuja and SVM Betterperformance shownwith the accuracy value of 78% by this
Chitra[22] technique.
22
Prajwala[23] DT and RF Random forest was show better performance than decision tree
23
Bum et al.[24] NB ,Logistic regression, and Focused on prediction of Fasting Glucose Level. 74.1% performance
,Anthropometry and accuracy measured by anthropometry.
24
Aruna and fuzzy rule, GA, and KNN, Some rule was generated.
Nazneen[25]
25
Sakorn[26] Expert system and fuzzy rule Focused on Expert system and was developed for treatment purpose.
26
Seokho et al.[28] SVM ,E2_SVM 80 % accuracy measured as better by using E2_SVM
27
Emrana et al.[11] KNN,C4.5 C4.5 provided more accuracy of 90.43 % and KNN provided
accuracy of 76.96%
28
Kamadi et al.[30] DT, Gini index, Gaussian fuzzy DT model shown good and efficient accuracy than other methods.
function
29
MunazaRamzan[29] J48,Naïve Bayes ,and RF RF provided better accuracy than J48 and Naïve Bayes in 10 cross
validation Evaluation method.
30
Patil et al.[32] HPM 92.38% accuracy recorded using HPM.
31
Abdullah et al.[31] Support vector machine Effective treatment of prediction was done using this technique.
32
Mounika et al.[32] ZeroR,NB, and oneR Effective treatment was applied on young and old patient. NB was
better performance than others method
33
Nongyao and LR, Boosting, Naïve Bayes, ANN, 85.558% accuracy was measured using this Random Forest
Rungruttikarn[34] Bagging, and Decision tree. technique.it is recorded as better accuracy
34
Amit and Pragati RF,MLP,C4.5,and Bayes Net The combination of MLP+BayesNet were shown better accuracy of
[36] 81.89% and better than other classification algorithm
35
Saba et al.[35] NB,HMV,RF,Adaboost, Focused on various diseases including diabetes studied .78.085%
KNN, LR,and SVM accuracy measured by HMV algorithm it is recorded as better accuracy
36
Rian and Fuzzy Rule Different rule was generated which helpsfor earlydetection of diabetes.
Irwansyah[37]
37
873
International Journal of Pure and Applied Mathematics Special Issue
Based on the extent literature, we established on employing The time complexity of this technique is short .computes
based on possibility by using the probability formula. It used
four most known prediction algorithm such asSupport
to maximize the probability of (C|F)
vector machine (SVM), NaïveNet (NN, and DecisionStump
(DS) classification algorithm and combined the prediction Maximization =PR (class | feature)
of them in to one to increase the prediction accuracy of the
Step1: The data should be convert into frequency table
algorithm using base learner.
Step2: Find likelihood
A. Support Vector Machine (SVM) Step3:-In third stepuse naïve Bayes equation. Here the
Support vector machine algorithm is one of the most prediction is done.
popular and widely used machine learning techniques. (C|F) means PR (class | feature)
Step1:-First we must identify the right hyper plane C. Decision Stump (DS)
Step2:-After the first step the second step ismaximizing the It is one of the most popular machine learning classification
distances between neighbour data point algorithm that used in single level impute value .most of the
time it is appropriate for an ensemble method specially in
Step3:-Add a feature
boosting that is one of the reason .
z=x^2+y^2.it indicates that svm solves such problem.
D. Collaborative (Ensemble) model
Step4:-Apply Svm classifier to classify the class .the class is In prediction purpose individual prediction algorithms are
binary not provided better and efficient performance the
collaborative approach solves the limitation of distinct
classifiers to cop up the accuracy better by combining in to
one. [12, 32]
Dataset
Decision Stump
SVM Naïve nets
Validation
method
-10 fold cross
Testing Set
validation
Final prediction
874
International Journal of Pure and Applied Mathematics Special Issue
correctly claccified
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 = 100 ∗ correctly classified + incorrectly classified
pe r f or manc e of algor it h ms
100 90.36
88.8 88.54 85.68
PERCENTAGE %
90 83.72
80
70
60
50
40
30
20
10
0
SVm Bayes Net DecisionStumb AdaboostMl Propoesed method
ALGORITHM
V. CONCLUSION
There are Various data mining method and its application Therefore, using ensemble method used to provide better
were studied or reviewed .application of machine learning prediction performance or accuracy than single one.
algorithm were applied in different medical data sets
including machine Diabetes dataset.Machine learning VI. FUTURE WORK
methods have different power in different data set. We In this study we concentrated only Diabetes disease for
obtained 768record diabetes data set from UCI.the future it can be extended to apply this method in another
comparison of individual algorithm and the proposed diseases Small amount sample data used on this study.it can
method is done on this study. We applying 10 cross be apply in large amount of data for future extension .on this
validation us for evaluation of the performance of these study also only a single data set used therefore for future
machine learning classification methods purpose. In this multiple data set can be used for prediction .in this study
study the proposed method provide high accuracy with only limited base classifier used .for future it is possible to
accuracy value of 90.36% and decisionStump provided less use another base classifier like ANN, Nave Bayes, KNN
accuracy than other by providing 83.72% accuracy. ,Random tree ,and other .
875
International Journal of Pure and Applied Mathematics Special Issue
876
International Journal of Pure and Applied Mathematics Special Issue
877
878