You are on page 1of 6

IPASJ International Journal of Computer Science (IIJCS)

Web Site:

A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992


J. Jamila Yasmin Banu1, Mr.S. Babu2
M.Phil Research Scholar, SCSVMV University Enathur, Kanchipuram. TamilNadu , India 631561
Associate Professor, SCSVMV University Enathur, Kanchipuram. TamilNadu , India 631561

Data mining techniques are widely used in medical diagnosis for patterns recognition, processing and treatment. Diabetes is
considered as a metabolic disease where high blood sugar levels sustain over a period of time. In this paper we concentrate on
the Diabetes Personalized Healthcare Pathways by accessing diabetes patients data that are sampled (300Nos.). These data are
collected and then Pre-Processed for research purpose. Later K-means algorithms and KDD algorithms are implemented for
Prediction, Classifying and Clustering that are related to several data mining techniques in order to predict diabetes. The
proposed thesis of data-mining techniques in the field of Diabetes Personalized health care will lead to useful extraction of
valuable knowledge and to generate new hypothesis for further research/experimentation in this field. The derived results can
be used for both scientific research and real-life practices to enhance the quality of diabetes patients. Through this research
data mining can be used in many fields of medicine and can be developed to give doctors help to provide effective treatment
and early diagnosis of several diseases through the result obtained using data mining techniques.

Keywords:-Data mining, Data pre-processing, Classifier, Prediction, Clustering.

Data mining with extremely extensive and intensive applications in various organizations is increasing due to its
popularity in the healthcare sector. Data mining applications are of great benefit to all parties who are part of the
healthcare industry. It can also support as a second-hand to industries like healthcare and fitness care, to detect
mistreatment and fraud, etc. Healthcare associations can strengthen their customer relationship through efficient
management decisions, helps physicians to identify effective treatments and patients can receive affordable and better
healthcare services. Data mining technology promotes methods and the knowledge to efficiently process and analyze
huge amounts of data and convert them into useful information for decision making.With other advancements in
technology like wireless sensor networks can make remote monitoring possible and at the same time provide real time
data acquisition to make on time decisions. Remote monitoring is a critical and important in healthcare services for
elderly who lives alone and for those who have diabetics and cannot be hospitalized. The cost involved in traditional
face-to-face healthcare services and considering the population of such groups certain countries insist on healthcare
remote services. This study is proposed as an effort to foresee the outcome of data mining on data acquiredfrom
citizens, senior citizens research project. The huge volume of data and the diversity of their time bases it is complex
due to different types of signals and data are the main challenges in this study. Besides that the obtained results must be
a kind of testimony so that it creates enough motivation among the patients to enhance their self-care. It must also be
applicable as a decision support system among the physicians and health centers in diabetes treatment and

Tahani Daghistani, Riyad Alshammari proposed a paper work tilted as "Diagnosis of Diabetes by Applying Data
Mining Classification Techniques"Vol. 7, No. 7, 2016. Data involved in health care is enormous and complex too as it
contains different variable types and missing values at times. In the present scenario knowledge from such data cannot
be eliminated and it is a necessity to extract the appropriate information from it. Data mining is the most suitable
technique that can be utilized to extract knowledge from big data by implementing models from health care data like

Volume 5, Issue 9, September 2017 Page 33

IPASJ International Journal of Computer Science (IIJCS)
Web Site:
A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992

diabetic patient data sets. Three data mining algorithms are discussed in this research namely C4.5, Self-Organizing
Map (SOM) and Random Forest, these methods are applied on adult population data that is received from Ministry of
National Guard Health Affairs (MNGHA), Saudi Arabia to estimate diabetic patients using 18 risk factors. Among the
three techniques, RandomForest was able to achieve bestperformance results compared to the other data mining

Sukhjinder Singh, Kamaljit Kaur titled as A Review on Diagnosis of Diabetes in Data Mining Volume 4 Issue 6,
June 2015. The author defines data mining as a tool used for several purposes like medical, industries, etc. It is used for
extraction of useful information from huge amount of data set. Health monitoring also uses data mining concept to
predict and diagnosis a disease. In health monitoring diabetes is the most common problem that affects a huge crowd.
There are several data mining techniques and algorithm applied to analyze diabetes. Artificial neural fuzzy interference
system, Neural Network, Genetic Algorithm, KNearest - Neighbor (KNN), Back Propagation algorithm etc are some of
them. The techniques delivered better results in diagnosing diabetes.

Dr.B.L.Shivakumar, S. Alby titled as A Survey on Data-Mining Technologies for Prediction and Diagnosis of
Diabetes. 2014 International Conference on Intelligent Computing Applications. The reports of WHO shows that there
is a remarkable hike in diabetic patients and this can remain in coming decades as well. Early prediction of the disease
is an important challenge even now. Data mining has seemed to play an important role in diabetes research based on
the early literature reviews. Data mining is indeed the most valuable asset for diabetes researchers since it is capable to
unearth hidden knowledge from a huge amount of diabetes related data. Several diabetic researches use data mining
techniques toimprove the quality of health care specific to diabetes patients. This paper will provide a survey on various
data mining methods that are commonly applied to Diabetes data analysis and prediction of the disease.

Ning Wang, Guixia Kang paper proposes a novel type 2 diabetes mellitus (T2DM) monitoring system. The decision
on the status of diabetes control and predictions of future blood glucose of an individual is made by the system and it
depends on both the original medical data entered manually or generated automatically. When dealing with such data,
initially we clean and transform data and contexts and then build data mining models with the help of several mining
algorithms. After the mining process, we analyze and assess the accuracy and sensitivity of the formulated models, to
find out the appropriate models for decision making and predicting in case of diabetes control.

Panigrahi Srikanth ; Dharmaiah Deverapalli This paper estimates or predicts Diabetes Disease based on the Data
Mining Techniques of Classification Algorithms. Classification Algorithm and tools help to reduce heavy work that is
implied on Doctors. Evaluated as Classification Algorithms for the Classify of some Diabetes Disease Patient Datasets
are used in this paper. Data Mining is used as one of the main Algorithm in Classification. Classification Algorithm
examines the Decision Tree Algorithm, Rule based Algorithm and the Byes Algorithm. These algorithms evaluates the
Error Rates and identifies patients based on evolution Function of the measure the accurate results.



A multi-sensor integrated system is designed to be used to collect data like blood pressure, BG, daily physical activity
weight, calories consumption and emotional states from the patients. Bluetooth enabled glucose meter are used to
measure and share BG data to the server. Diabetics must do BG measurements six times a day BFB (Before Breakfast),
AFB (After Breakfast), BFL (Before Lunch), AFL (After Lunch), BFD (Before Dinner) and AFD (After Dinner).


Data mining and knowledge discovery are considered the same by many whereas some view data mining as an essential
step for knowledge discovery. Below is the list of steps involved in the knowledge discovery process

Data Cleaning Noise and other pollution in data like inconsistent data are removed.
Data Integration Data from various sources are combined.

Volume 5, Issue 9, September 2017 Page 34

IPASJ International Journal of Computer Science (IIJCS)
Web Site:
A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992

Data Selection Relevant data to the subject is retrieved from the database.
Data Transformation Data is converted or consolidated into appropriate forms for mining by executing
summary or aggregation operations.
Data Mining Intelligent methods are applied to extract data patterns.
Pattern Evaluation Data patterns are evaluated.
Knowledge Presentation Knowledge is represented.

Importing the recorded data from GMDH Shell data mining and then forecasting in the text format is the first step in
data preparation. The main goal of this thesis is oriented with diabetics, hence BG data is considered as the primary
data. Then the other related data are prepared as secondary data.

The other data apart from the primary and secondary data are neglected in the analysis step. In diabetes care, blood
samples are to be examined before each meal and again one or two hours after meal. So it has to repeat six times per
day. In default it is assumed to be taken at 6, 9 and 11AM, 2, 5 and 8PM. So each measured BG level will be assigned
to the closest time stamp. This assignment was done to analyze and visualize the diabetes behavior.
Similar preparation procedure is done on systolic and diastolic blood pressure measurements as well in the time domain
limited by BG measurements.

Daily physical activities in different levels have been averaged in each half an hour and are used as the basis for patient
activity and for calories consumption. Each level of activity will have a constant coefficient in terms of calories
consumption. As the coefficients, that are used by application are unknown, so estimation procedure is used instead.
The coefficients related to calories consumption are estimated by Problem Solver in Microsoft Excel.
Activity at time t=activity level 1 * 0.42 + activit
y level 2 * 1 + activity level 3 *3.37 + activity level 4 * 5.97 + activity level 5 * 6.01 Here activity level 1 to 5 are
passive, low, medium, high and very high activity measurements respectively.Daily calories consumption,
feelings, weight and measurement data are prepared by the time domain of BG measurements.


In case of diabetes control, daily diet and the injected insulin doses are the important factors and the rest of the factors
like daily activity and feelings are minor. In this study, we did not get information on insulin injection and daily diet.
Hence we consider patients life style as an important aspect to affect patients diabetes self-care. The frequency
analysis generates statistics and graphical displays which are useful to analyze the measured data. Providing
histograms, frequency reports and bar charts helps us to visualize on how data is distributed under different categories.


GMDH Shell is kind of a tool which is easy-to-use in case of data mining and fore casting of multi-parametric datasets.
It also carries out a fully automatic structural and parametric optimization of model.

Useful Fields: -
Data mining,
Predictive analytics,
Time series analysis, and
Forecasting and knowledge discovery.

The software tool is capable of performing machine learning knowledge and comprehensive capabilities for effective
use of multiprocessor, multi-core and clustered computers.

The GMDH Shell will process data in a much easier way. It can mechanically identify usable data that is present inside
a file; it can transform data according to the problem type, it can drop irrelevant inputs and formulate a set of analytical
and predictive models at the base of optimal complexity detection and self-organization principles.

Volume 5, Issue 9, September 2017 Page 35

IPASJ International Journal of Computer Science (IIJCS)
Web Site:
A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992

The plotting of Patient data based on their OCCUPATION, as three data clusters. All the three cluster centroids are
marked with a +, stating that the average point on space that cluster. Outliers are detected as values that fall outside
the cluster sets.

The centroid is the most typical case in a cluster. For example, in a data set consisting of patient ages and incomes, the
centroid of these clusters would be a patient of average age and average income in that cluster. If the data set includes
gender, then the centroid will have the gender as the most frequently represented in the cluster.

Centroid is a prototype and it does not always describe any given case assigned to the cluster. The attribute values for
the centroid will be the mean of the numerical attributes and the mode of the categorical attributes. Data Mining helps
in the scoring operation for clustering. In addition it also helps to construct the clusters from the build data, clustering
models create a Bayesian probability model that can be used to score new data.


Step 1: The data is clustered into k groups, k is predefined.

Step 2: k points are selected at random as cluster centers.
Step 3: Objects are assigned to their closest cluster center based on the Euclidean distance function.
Step 4: Compute the centroid or mean of all objects that are in each cluster.
Step 5: Steps 2, 3 and 4 are repeated until the same points are assigned to each cluster in the following consecutive

K-Means is an efficient method but the number of clusters has to be mentioned at the beginning of the algorithm and
the final result varies based on this initial assumption and terminates at a local optimum. Unfortunately there is no
recognized global theoretical method to compute this optimal number of clusters. The most practical approach is to
compare the outcomes derived from multiple runs with different k and then choose the best one based on a predefined
criterion. So in general, a large value for k can probably decrease the error but will increase the risk of over fitting.


The main aim of this proposal is to predict people with different age groups on being affected by diabetes
considering their life style activities and also to find the factors responsible for diabetic. For this purpose a
statistical method in medical field is proposed to understand the age group that is highly affected due to

Fig: 1 Demonstrates how to measures the pregnancy, Skin min, glucose, BP, max and mean values of clusters

Volume 5, Issue 9, September 2017 Page 36

IPASJ International Journal of Computer Science (IIJCS)
Web Site:
A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992

Fig.2. X axis denotes the number of clusters and y axis denotes average variants. Second cluster depends on the average
variants below 55 % .Under 55 % variant eliminate by tool. 4 clusters depends on the diabetes analysis under 40 %
because NPREG 83 % mean values.

Fig.3: Demonstrates Combination of Pregnancy and Glucose X axis represents number of Clusters & Y axis represents
variance explained, below 30% variance diabetes is leisure and above 70% variance high diabetes


Data mining and machine learning helps in the extraction of concealed patterns from big data in the field of medical
care. They can be used to examine vital clinical parameters, predict diseases, and estimate assignments in
pharmaceutical, treatment planning support and patients administration. Various algorithms are proposed for the
prediction and analyzing diabetes. The existing methods provide more precise information than the accessible
conventional frameworks.
The models formulated from the data mining algorithms can help to support decision making in various fields that
includes health care field. In this research, real health care data has been collected, pre-processed, Classifier,
Prediction, Clustering are all executed. Time series data mining is also been evaluated, to propose data mining models
in order to predict diabetic patients using health care data sets.

Volume 5, Issue 9, September 2017 Page 37

IPASJ International Journal of Computer Science (IIJCS)
Web Site:
A Publisher for Research Motivation ........
Volume 5, Issue 9, September 2017 ISSN 2321-5992

The outcome of the research demonstrates that the construct data mining model can assist health care providers to
make affordable scientific decisions to classify diabetic patients. Additionally the model can be further developed to
consider patient protection as well. In future, the results can be used to form a control plan for diabetes since diabetic
patients are normally not recognized until the later stage of the disease is achieved.
[1].Weidong Mao, Jinghe Mao, "The Application of Apriori-Gen Algorithm in the Association Study in Type 2
Diabetes", Proc. of the 3rd International Conference Bioinformatics and Biomedical Engineering(ICBBE 2009), pp
1-4, 2009
[2].Palivela Hemant, Thotadara Pushpavathi, "A novel approach to predict by cascading clustering and classification",
Proc. of the 3rd International Conference on Computing Communication &Networking Technologies, pp 1-7. 2012
[3].S.M. Nuwangi, C. R. Oruthotaarachchi, J.M.P.P. Tilakaratna, H. A. Caldera, "Utilization of Data Mining
Techniques in Knowledge Extraction for Diminution of Diabetes", Proc. of the Second Vaagdevi International
Conference on Information Technology for Real World Problems(VCON), pp. 3-8 , 2010.
[4].Asha Gowda Karegowda, Punya V, M A Jayaram, A S Manjunath, "Rule based Classification for Diabetic Patients
using Cascaded K-Means and Decision Tree C4.5", International Journal of Computer Applications (0975-8887)
Volume 45-No.12, May 2012 Show Context
[5].Jianchao Han, Juan C. Rodriguze, Mohsen Beheshti, "Diabetes Data Analysis and Prediction Model Discovery
Using RapidMiner", Proc. of the Second International Conference on Future Generation Communication and
Networking, Vol. 3, pp 96-99, 2008.
[6].Kavitha K , Sarojamma R M, " Monitoring of Diabetes with Data Mining via CART Method", International Journal
of Emerging Technology and Advanced Engineering, Website: ISSN 2250-2459, Volume 2, Issue
11, November 2012.
[7].Bum Ju Lee, Boncho Ku, Jiho Nam, Duong Duc Pham, Jong Yeol Kim, "Prediction of Fasting Plasma Glucose
Status using Anthropometric Measures for Diagnosing Type 2 Diabetes", IEEE Journal of Biomedical and Health
Informatics, Vol. pp, Issue. 9, page 1, TITB-00020-2013Quick Abstract Show Context View Article Full Text:
PDF (878KB)
[8].World guide to IDF BRIDGES 2012.
[10].P Kasemthaweesab, W Kurutach, "Association Analysis of Diabetes Mellitus (DM) with Complication states
Based on Association Rules", Proc. of the 7th IEEE Conference on Industrial Electronics and Applications, pp.
1453-1457, 2012. Quick Abstract Show Context View Article Full Text: PDF (291KB)
[11].Chien-Lung Chan, Chien-Weichen, Ban-Jhiune Liu, "Discovery of Association Rules in Metabolic Syndrome
Related Diseases", Proc. of the International Joint Conference on Neural Network, pp 856-862, 2008. Quick
Abstract Show Context View Article Full Text: PDF (264KB)
[12].M Velu, K R Kashwan, "Visual Data Mining Techniques for Classification of Diabetes Patients". Proc. of the
IEEE 3rd International Advance Computing Conference, pp. 1070-1075, 2013.

Volume 5, Issue 9, September 2017 Page 38