You are on page 1of 10

Classifcation Analysis and Mining for

Unexpected Subtypes in Diabetes Database


through WEKA tool
Gulzar Ahmad (SAP: 22388)
Department of Computer Science

Abstract--Online Social networks are most popular among the II. LITERARURE REVIEW
people in some past years. People use social networking places to
connect through their relatives, precious ones, friends and The few application of medical data mining as compared to
colleagues for social contacts. As the data increase rapidly and this other fields described their skill in trying to repeatedly attain
create issue related to security and privacy in online social networks. medical data from clinical records. They did some
So, therefore, retrieval of information about the trends and problems experimentations on three clinical databases and the directions
in online social networks. Datasets for online social media networks brought are used to comparison against a set of strong clinical
to be analyzed and visualized by client or by user. For classification rules. Previous research in dealing with this problem can be
and visualization of data, WEKA tool is used. Holistic approach is defined with the next approaches:
considered to classify and analyze the diabetes dataset for data
processing. In this research paper for preprocessing prediction, we • Determine all rules initially and then permit the user to
used diabetes.arff dataset. WEKA tool is very useful for classification
enquiry and repossess those he/she is involved in. This
of data and for analyzing the dataset diabeties.arrf. Result of this
research helps us in prediction that may or may not be individually typical approach is that of patterns [3]. This approach
infected from the diabetes. In this research paper it is evaluated that permits the user to investigate what rules he/she is involved
peoples are infected from diabetes those have age greater or equal to as patterns. Then the system uses the patterns to repossess
40 and mass greater than 35. Beside this, the peoples those are no the rules that match the patterns from the set of discovered
suffered from diabetes have age and mass less than 35. To analyze rules.
any person should not allow to modify the data who does not have • Use restrictions to constrain the mining process to produce
authority. Similarly, Dhawan and Ekta discoursed numerous bogus only related rules. [2] offers an algorithm that can take item
profile of detection procedures in Social Networks which help out the
limitations specified by the user in the association rule
user’s data to keep the safe from damaged.
mining process so that only those rules that satisfy the user
Keywords—WEKA, analysis, classification, dataset, machine specified item constraints are produced. This also does not
learning work well for doctors frequently do not have any specific
rules to mine.
I. INTRODUCTION • Find unpredicted rules. This approach initially requests the
Many people share their ideas, media, feelings on social user to require his/her existing information about the field.
media networks by connecting with each other. When peoples The system finds those unpredicted rules [5].
connect on social networks the data is being generated in very
large amount and at very high rate. Data is generating at large A good amount of data mining research exists in the field of
scale of due to production and development at large scale in an medical diagnosis. It is worthwhile at the outset to take a stock
organization. Social network site necessity to mine the past data of recent development related to the proposed research.
to improve their products and services. Data cannot be modified
by any unauthorized user when he analyzes the data. Ekta and Researchers have applied Incremental Learning and
Dhawan discussed many profile of fake detection techniques in Decision Tree and for the observed symptoms of cardiac and
social media networks which protect user’s data from damage. diabetes at the early stage [1]. Dataset is collected from the
Objects of data sets are classified related to its similarities. The patient data set which were logged in the clinical record of the
most used method and best known is classification. The target hospital. Dataset is analysed by algorithm for classification like
class of object is accurately predicted by classification of which some models name d as decision trees and classification rules.
the class label is unknown. Classification algorithms are
provided in nine groups in WEKA implementation. Following Researchers have developed a decision supporting system
algorithms are selected named as Naïve Bayes, logistic, J48. for analysis of disease that makes use of mining techniques of
data [2]. There are almost three different classifier algorithms
based on artificial intelligence named as Naive Bayes,
Multilayer Perceptron and J.48 were applied on data set of • Comparative behaviour of different algorithms for several
Diabetes. These classifiers are usually implemented in the models are selected as based upon their efficiency.
fields of biomedical engineering, data-mining and medical Evaluation phase measures the degree to which one model
diagnosing the patients. meets the required objectives.

Researchers have collected the data of diabetes patient which


is consists of 768 instances with 9 different attributes from
repository of hospital [3] and performed data classification on
it. The instances in the Dataset are relating to the two groups of
blood test, urine test. WEKA tool implements a classification
on the data and the data is assessed by means of 10-fold cross
authentication.

Authors have discussed the techniques of mining the data to


process a dataset and classify the significance of the data set
classification [4]. Output represent the method of WEKA
analyzation of file changes and collection of attributes to be
mined and assessment with Knowledge Extraction

Researchers have to presented an intelligent proposed model


for healthcare related to diabetic patient [5]. This web-based Figure 1. Research Framework
mining tool for data carries a lot of compensations in well-
resourced hospitals, by utilisation of resources, estimate of The classification objective is to allocate a class to discover
patient’s disease. earlier unobserved data records as possibly precise. The main
objective is to construct a model which classify the attribute’s
Authors have conducted a systematic review of applications classes, the accuracy of constructed model is determined by
of data-mining techniques in the field of diabetes research. using test data. To check the performance of model, dataset is
MEDLINE database from Pubmed is used for analysis [6]. splitting into training sets and test set. For validation of model
Authors have reviewed around 20 articles in the related field. test sets are used and for constructing the model training sets
Information extracted from the articles and presented with are used.
purpose of the study, group/topic of research, diabetes type, Data mining (mostly known as data or knowledge discovery)
data set used, data-mining methods applied, data-mining is the method of analysis the data from dissimilar perceptions
software and technology utilized and outcome of the data- and summarized the data into valuable information which can
mining application. be implements to growth in revenue, cuts costs, or both. The
software used for Data mining having many different tools for
analysis data. It permits the users to process and analysis of data
III. RESEARCH METHODOLOGY from various different angles and dimensions, classify it, and
Data mining basically is the well-known process to precis the associations known. Data mining is the method of
implementation a methodology known as a computer-based extracting the correlations or patterns between dozens of
methodology for extracting the prior knowledge from different different fields in very huge relational databases. In our research,
data [12]. The diagrammatic representation of the proposed various data mining process were applied on the social-
research framework is shown in Figure 1. demographic data of user. The main objective of diabetes data
mining is very important to enhance the actuality and
• The methodology adopted for the implementation of rectification of the data.
research problem begins with data collection and
preparation. The Pima Indians Diabetes Database training WEKA TOOL
dataset which is used for data mining retrieved from UCI In this section we discussed the overview of WEKA. WEKA
Machine Learning Repository. The phase of preparation stands for Waikato Environment for knowledge Analysis. This
the data achieved almost all activities for fully conversion tool is widely used for mining of data. JAVA language is use to
of initial raw data to final dataset. write this tool. WEKA tool was developed by University of
Waikato in New Zealand. WEKA tool is freely accessible on
• Many different modelling procedures are adopted for internet under general public license. This tool takes input of
applied, and calibrate their parameters for optimum values. different type of extension files like .CSV and, .ARRF file. We
Classification algorithm techniques are applied for the explore file and after explore we can perform classification,
identical data mining problems. As per the requirement clustering, association etc. latest version of java should be
subset of dataset is converted into required form.
installed in system before installing the WEKA Tool. WEKA Classification through NaiveBayes
tool can be download form it official site. [1] The algorithm named as Naïve Bayes is depending upon
theorem which is Bayesian and work on conditional
IV. DATA PROCESSING probabilities. For predicative modeling it is a very powerful
For prediction of individual’s weather he may diseased or algorithm. This algorithm works very well on concerning real
not by the diabetes int the dataset diabetes.arrf which is used. world situations. As example the naïve Bayes is suitable for the
The name of dataset is diabetes.arrf is dataset file which contains spam filtering which is very popular problem. There should be
different attributes which help user in prediction of the diabetes no missing data in this algorithm and the variable must be
affected person. When we load this file in WEKA, we see discrete. So, there is no missing data in the dataset.
attributes which contain by datasets as shown in fig.

Fig.3 Classification through NaiveBayes


Fig. 1: ARFF file processed in WEKA
Classification through tree J48
In above dataset multiple attributes of different types
J48 based on supervised learning techniques. J48 also known
like preg, class, plas, mass, age, etc. Data can be preprocessed
as free classifier who accept nominal classes only. While
and analyzed in WEKA tool by using multiple data mining
classifying instance prior knowledge should be there .it is used
techniques. For example, clustering, visualization, regression,
in the construction of the decision tree. Decision tree is built by
classification etc. fig.2 processed attributes are graphical
splitting into subset and normalization information can be
represented.
calculated. When all instance in a subset related to same class
splitting process is end. Both discrete and continuous attributes
are used by J48.difference attributes are lost. Due to low cost,
ease of implementation J48 frequently used algorithm

Fig.2 Classification through NaiveBayes Fig4 Classification through tree J48


which one is suffer from diabetes having mass less than 35 and
age is 30. Unlike that those people suffer from the diabetes have
age 40 and mass more than 35

V. CONCLUSION
We can get information from dataset by data mining.
Obtaining information from the data mining help the
organization in improving their business and products. We can
perform data mining efficiently and precisely by through Weka
tool. This research paper evidences the WEK’s performance to
analyze the diabetes data. Three different algorithms are used to
ranking high and lower attributes which predict weather the
individual may infected or have no symptoms of diabetes.

Fig4 Classification through tree J48


REFERENCES
Classification through function Logistic [1] Sanjeev Dhawan and Ekta, “Implication of Various Fake
Target variable can be used as categorical variable in ProfileDetection Techniques inSocial Networks”, IOSR Journal of
logistic regression predictive modal. The variables have to Computer Engineering (IOSR- JCE), AETM’16, 2016, pp. 49-55.
[2] M. Venkat Dass, Mohammed Abdul Rasheed and Mohammed Mahmood
categories like disease /does not have any disease, live/die, Ali, “Classification of lung cancer subtypes by data mining technique”,
purchase/ does not purchase. Decision tree are not present in a 2014
logistic regression model, nonlinear regression is more in [3] International Conference on Control, Instrumentation,
regression model for example putting polynomial to set of data Energy and Communication (CIEC) IEEE, pp. 558 – 562
[4] Priyanka R Shah, Dinesh B Vaghela and Priyanka Sharma, “Faculty
values performance evaluation based on prediction in distributed data
mining”, 2015 IEEE International Conference on Engineering and
Technology (ICETECH), IEEE 2015, pp. 1 - 5 .
[5] Hina Gulati, “Predictive analytics using data mining technique”, 2nd
International Conference on Computing for Sustainable Global
Development (INDIACom), IEEE 2015, pp. 713-716.
[6] Ashwinkumar.U.M and Dr Anandakumar.K.R, “Predicting Early
Detection of Cardiac and Diabetes Symptoms using Data Mining
Techniques”, 2012 2nd International Conference on Computer Design and
Engineering (ICCDE 2012), IPCSIT vol. 49 (2012) © (2012) IACSIT
Press, Singapore
[7] Murat Koklu, Yavuz Unal, “Analysis of a Population of Diabetic Patients
Databases with Classifiers”, International Journal of Medical Science and
Engineering World Academy of Science, Engineering and Technology,
vol. 7, no. 8, 2013, pp. 772-774

[8] P.Yasodha, M. Kannan, “Analysis of a Population of Diabetic Patients


Databases in Weka Tool”, International Journal of Scientific &
Engineering Research, vol. 2, no. 5, May 2011
Fig5. Classification through function Logistic
[9] Trilok Chand Sharma1, Manoj Jain, “WEKA Approach for Comparative
Study Classification Algorithm”, International Journal of Advanced
To get information regarding diabetes, the logistic functions Research in Computer and Communication Engineering, vol. 2, no. 4,
in classifies have to be analyzed. The result shows that the April 2013, pp. 1925-1931
classification is not accurate to measure. So, to get accuracy the
[10] MD. Ezaz Ahmed, Dr. Y.K. Mathur and Dr Varun Kumar, “Knowledge
algorithms named as NaiveBayes is applied. The NaiveBayes Discovery in Health Care Datasets Using Data Mining Tools”, (IJACSA)
algorithms provide refined ranking to attributes. After the International Journal of Advanced Computer Science and Applications,
ranking of all attributes one can ignore the lower rank attributes vol. 3, no.4, 2012
to get the accurate result. The fig 3 shown which have lower and
[11] Miroslav Marinov, Abu Saleh Mohammad Mosa, Illhoi Yoo, and Suzanne
high rank attributes Austin Boren, “Data-Mining Technologies for Diabetes: A Systematic
Review”, Journal of Diabetes Science and Technology, Diabetes
Fig 4 show real results which are very helpful to predicting Technology Society, vol. 5, no. 6, Nov. 2011
that person are infected from diabetes or not. The result shows
that weather the person have diabetes or not. It’s mean it separate [12] M.Vijayakamal, Mulugu Narendhar, “A Novel Approach for WEKA &
Study On Data Mining Tools”, International Journal of Engineering and
the diabetic and non-diabetic individuals. In WEKA Innovative Technology (IJEIT), vol. 2, no. 2, Aug. 2012
visualization can be perform on result as shown fig 7. In the
result there are test positive and test negative. The two values [13] Wynne Hsu, Mong Li Lee, Bing Liu and Tok Wang Ling, “Exploration
like mass and age can be useful for prediction the diabetes that Mining in Diabetic Patients Databases: Findings and Conclusions”
Classifcation Analysis and Mining for
Unexpected Subtypes in Diabetes
Database through WEKA tool
By Gulzar Ahmad

WORD COUNT 2106 TIME SUBMITTED 13-JAN-2021 05:33AM


PAPER ID 67809847
Classifcation Analysis and Mining for Unexpected Subtypes in
Diabetes Database through WEKA tool
ORIGINALITY REPORT

13%
SIMILARITY INDEX

PRIMARY SOURCES

1 es.scribd.com
Internet 208 words — 10%
2 Marinov, M., A. S. M. Mosa, I. Yoo, and S. A. Boren.
"Data-Mining Technologies for Diabetes: A Systematic
30 words — 1%
Review", Journal of Diabetes Science and Technology, 2011.
Crossref

3 www.gezondheidsaward.nl
Internet 18 words — 1%
4 www.educba.com
Internet 13 words — 1%
5 www.science.gov
Internet 8 words — < 1%

EXCLUDE QUOTES ON EXCLUDE MATCHES OFF


EXCLUDE ON
BIBLIOGRAPHY

You might also like