You are on page 1of 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313417662

GA_RBF NN: a classification system for diabetes

Article  in  International Journal of Biomedical Engineering and Technology · January 2017


DOI: 10.1504/IJBET.2017.10003045

CITATIONS READS

2 514

2 authors:

Dilip Kumar Choubey Sanchita Paul


Indian Instiyute of Information Technology Bhagalpur Birla Institute of Technology, Mesra
39 PUBLICATIONS   219 CITATIONS    49 PUBLICATIONS   402 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

KS Lab View project

Prediction of thunderstorm and lightning using soft computing and data mining View project

All content following this page was uploaded by Dilip Kumar Choubey on 02 August 2017.

The user has requested enhancement of the downloaded file.


Int. J. Biomedical Engineering and Technology, Vol. 23, No. 1, 2017 71

GA_RBF NN: a classification system for diabetes

Dilip Kumar Choubey* and Sanchita Paul


Department of Computer Science & Engineering,
Birla Institute of Technology, Mesra,
Ranchi 835215, India
Email: dilipchoubey_1988@yahoo.in
Email: sanchita07@gmail.com
*Corresponding author

Abstract: The modern society is prone to many life-threatening diseases,


which if diagnosed early, can be easily controlled. The implementation
of a disease diagnostic system has gained popularity over the years. The main
aim of this research is to provide a better diagnosis of diabetes disease.
There are already several existing methods, which have been implemented
for the diagnosis of diabetes dataset. Here, the proposed approach consists of
two stages: in first stage Genetic algorithm (GA) used as an attribute
(feature) selection which reduces 4 attributes among 8 attributes, and in the
second stage Radial Basis Function Neural Network (RBF NN) has been used
for classification on selected attributes among all the attributes. The
experimental results show the performance of the proposed methodology on
Pima Indian Diabetes Dataset (PIDD) and provide better classification for
diagnosis of diabetes patients on PIDD. GA is removing insignificant features,
reducing the cost and computation time and improving the accuracy, ROC
of classification. The proposed method can also be used for other kinds of
medical diseases.

Keyword: PIDD; genetic algorithm; radial basis function neural network;


diagnosis; diabetes; attribute selection; classification; ROC; accuracy; precision;
recall; F-measure; confusion matrix.

Reference to this paper should be made as follows: Choubey, D.K. and


Paul, S. (2017) ‘GA_RBF NN: a classification system for diabetes’,
Int. J. Biomedical Engineering and Technology, Vol. 23, No. 1, pp.71–93.

Biographical notes: Dilip Kumar Choubey received his MTech degree in


Computer Science and Engineering from Oriental College of Technology
(O.C.T.), Bhopal, India, and has BE degree in Information Technology
from Bansal Institute of Science and Technology (B.I.S.T.), Bhopal, India.
Currently, He is pursuing PhD from Birla Institute of Technology (B.I.T.),
Mesra, Ranchi, India. He worked as an Asst. Prof. in Lakshmi Narain College
of Technology (L.N.C.T.), Bhopal, India, and Oriental College of Technology
(O.C.T.), Bhopal, India. He has 5 years of teaching and research experience.
His research interests include machine learning, soft computing,
bioinformatics, data mining, pattern recognition and database management
system, etc. He has ten international and one national publications.

Sanchita Paul received her PhD degree and ME degree in Computer Science &
Engineering from Birla Institute of Technology, Mesra, Ranchi, India, and
received BE degree in Computer Science & Engineering from Burdwan
University, West Bengal, India. She has approximately 10 years of teaching

Copyright © 2017 Inderscience Enterprises Ltd.


72 D.K. Choubey and S. Paul

and 11 years of research experiences. She has approximately 43 international


publications. Her research areas include cloud computing, big data, NLP, soft
computing, bioinformatics, AI & machine learning, data mining, etc.

1 Introduction

Diabetes is a chronic disease and a major public health challenge worldwide. Diabetes
happens when a body is not able to produce or respond properly to insulin, which is
needed to maintain the rate of glucose. Diabetes can be controlled with the help of
insulin injections, taking oral medications, a controlled diet (changing eating habits) and
exercise programs, but no whole cure is available. The main three diabetes symptoms
are: increased need to urinate (Polyuria), increased hunger (Polyphagia), increased
thirst (Polydipsia). There are two main types of diabetes: Type 1 (Juvenile or Insulin
Dependent or Brittle or Sugar) diabetes and Type 2 (Adult onset or Non Insulin
Dependent) diabetes. Type 1 diabetes mostly happens to children and young adults but
can affect at any age, 5–10% of diabetes have Type 1 diabetes. For this type of diabetes,
beta cells are destructed and people suffering from the condition require insulin injection
regularly to survive. Type 2 diabetes is the most common type of diabetes, in which
people are suffering at least 90% of all the diabetes cases. This type mostly happens to
the people more than 40 years old but can also be found in younger classes. In this type,
body becomes resistant to insulin and does not effectively use the insulin being produced.
It can be controlled with lifestyle modification, oral medications. In some extreme cases,
insulin injections may also be required but no whole cure for diabetes is available.
As we know that before some year’s physicians used to diagnoses the disease with
experience and clinical data of the patient’s means with laboratory tests reports. So, this
kind of diagnosis of the disease is time consuming mainly because it is entirely
dependent up on the availability and the experience of the physicians who have to deal
with imprecise and uncertain clinical data of the patients. So, to improve the decision
making with clinical data and to reduce time consumption, a good (intelligent) diagnosis
system is needed, and here we just analyses the input data (patient’s data i.e., PIDD) and
to develop an accurate description or model for each class using the attributes in the
dataset by which further easily based on the same concept, diagnosis system may be
developed. Researchers said that even the experience physicians are not able to detect the
disease quickly and accurately. It is always a problem for physicians to find the disease
more accurately with speedily. Here, using dataset is precise, no missing value have
found, noisy free dataset. In this chapter, for the analysis to diagnosing of diabetes
disease, the proposed method is implemented and evaluated by GA as an attribute
selection and used RBF NN for classification. By using GA method, the 4 attributes
obtained among 8 attributes. In particular, the utmost aim of an attribute selection is to
deduce the number of features used in classification, while sustaining the considerable
ROC and classification accuracy. However, reduction in the number of attributes is
critical in statistical learning. Notably, the attribute selection process helps us to preserve
the storage capacity, computation time (shorter training time and test time), computation
cost, increases classification rate, and comprehensibility. RBF NN is supervised learning
method for classification. In this study, the RBF NN has been used for the classification
(diagnosis) of the diabetes disease.
GA_RBF NN 73

The rest of the paper is organised as follows: brief description of GA and RBF NN
are in Section 2, related work is presented in Section 3, proposed methodology is
discussed in Section 4, results and discussion are present in Section 5, discussion and
future directions are devoted to Section 6.

2 Brief description of GA and RBF NN

2.1 GA
John Holland introduced GA in the 1970 at University of Michigan (USA). GAs are
adaptive heuristic search algorithms based on the evolutionary ideas of natural selection
and genetics. GA is an adaptive population based optimisation technique, which is
inspired by Darwin’s theory (Darwin, 1859) about survival of the fittest. GA mimics the
natural evolution process given by the Darwin i.e., in GA the next population is evolved
through simulating operators of selection, crossover and mutation. John Holland is
known as the father of the original GA who first introduced these operators in Holland
(1975). Goldberg (1989) and Michalewicz (1996) later improved these operators.
The advantages in GA (Choubey et al., 2015; Choubey and Paul, 2015) concepts are easy
to understand, solves problems with multiple solutions, global search methods, blind
search methods, GAs can be easily used in parallel machines, modular: separate from
application, supports multi-objective optimisation, good for “noisy” environment,
inherently parallel, easily distributed, and the limitation are certain optimisation
problems, no absolute assurance for a global optimum, cannot assure constant
optimisation response times, cannot find the exact solution. GA can be applied in
artificial creativity, bioinformatics, chemical kinetics, gene expression profiling, control
engineering, software engineering, travelling salesman problem, mutation testing, quality
control, business. Mainly, the GA utilises certain rules, i.e. selection, crossover, and
mutation, at each step to build the next generation from the current population. The GA
are more briefly illustrated by Choubey and Paul (2016), also selection, crossover,
mutation.

2.1.1 Selection
It is also called reproduction phase whose primary objective is to promote good solutions
and eliminate bad solutions in the current population, while keeping the population size
constant. This is done by identifying good solutions (in terms of fitness) in the current
population and making duplicate copies of these. Now in order to maintain the
population size constant, eliminate some bad solutions from the populations so that
multiple copies of good solutions can be placed in the population. In other words, those
parents from the current population are selected in selection phase who together will
generate the next population. The various methods like Roulette-wheel selection,
Boltzmann selection, Tournament selection, Rank selection, Steady-state selection, etc.,
are available for selection but the most commonly used selection method is Roulette
wheel. Fitness value of individuals plays an important role in these all selection
procedures.
74 D.K. Choubey and S. Paul

2.1.2 Crossover
It is to be noticed that the selection operator makes only multiple copies of better
solutions than the others but it does not generate any new solution. So in crossover phase,
the new solutions are generated. First two solutions from the new population are selected
either randomly or by applying any stochastic rule and brought over to mating pool in
order to create two offsprings. It is not necessary that the newly generated offsprings is
more, because the offsprings have been created from those individuals which have
survived during the selection phase. So the good bit strings combinations in parents
which will be carried over to offsprings. Even if the newly generated offsprings are not
better in terms of fitness then it should not be a botheration about because they will be
eliminated in next selection phase. In the crossover phase, new offsprings are made from
those parents, who were selected in the selection phase. There are various crossover
methods available like single-point crossover, two-point crossover, multi-point crossover
(n-point crossover), uniform crossover, matrix crossover (two-dimensional crossover),
etc.

2.1.3 Mutation
Mutation of an individual takes place with a very low probability. If any bit of an
individual is selected to be mutated then it is flipped with a possible alternative value for
that bit. For example, the possible alternative value for 0 is 1 and 1 for 0 in binary string
representation case i.e. 0 is flipped with 1 and 1 is flipped with 0. The mutation phase is
applied next to crossover to keep diversity in the population. Again, it is not always
possible to get better offsprings after mutation but it is done to search for few solutions in
the neighbourhood of original solutions.

2.2 RBF NN
One of the models in ANN or NN is RBF NN. The advantages of NN (Choubey et al.,
2015; Choubey and Paul, 2015) are that mapping capabilities or pattern association,
generalisation, robustness, fault tolerance, and parallel and high speed information
processing, good at recognising patterns, no mathematical process model needed, no rule
base knowledge required, different learning algorithms are available and the limitations
are needs training to operate, require high processing time for large neural network, not
good at explaining how they reach their decisions, black box, rules cannot be extracted,
prior knowledge cannot be used, no guarantee that learning converges, determine
heuristic parameters.
NN can be applied in pattern recognition, image processing, optimisation, constraint
satisfaction, forecasting, risk assessment, control systems. The RBF NN is a supervised
feed forward process with one hidden layer of hidden units, called Radial Basis
Functions (RBFs). These RBFs are supervised neural networks; hence, they require a
desired response to be trained. Interestingly, the RBFs learn that how to transform the
input data into a desired response to be trained, this quality make RBFs available for
wide use in pattern classification studies. Particularly, in the present study, a training
algorithm that normally uses gradient descend rule for the training trains certain
parameters of these networks. Nonetheless, the RBF networks are very popular for time
series prediction, function approximation, curve fitting, control and classification
problems, and adaptive & self-learning ability.
GA_RBF NN 75

The RBF NN is different from other NNs, possessing several distinctive features
because of their universal approximation, more compact topology, and faster learning
speed, RBF NNs have attracted much attention, and they have been widely applied in
many science & engineering fields. The structure of RBF NN is shown in Figure 1.

Figure 1 Feed forward neural network model for diagnosis of diabetes

The RBF NN consists of three layers: one input layer, one hidden layers, and one output
layer. A layer is a vector of units. Each layer consists of one or more nodes or neurons,
represented by small circles. The lines between nodes indicate flow of information from
one node to another node. In Input layer, the number of neurons is the same with the
number of input dimensions. The input layer is that which receives the input and this
layer has no function except buffering the input signal (Selvakuberan et al., 2011), the
link between inputs to hidden layer is not weighted, and calculates a value of the RBFs
received from the input layer. Input layer is a set of distribution nodes. Any layer that is
formed between the input and output layers is called hidden layer. Hidden layer is a set of
nodes each one characterised by a Gaussian radial basis function. This layer performs
computations and transmits the results to output layer through weighted links, the output
of the hidden layer is forwarded to output layer. These values (hidden layer output) will
be transmitted to the output layer, which calculates the values of linear sum of the hidden
neuron. The output layer generates or produces the output of the network or classifies the
results or this layer performs computations and produce final result. Output layer is a set
of nodes each of which gives one output.
Notably, the Gaussian function is widely utilised for the activation function.
Therefore, in this view, here the Gaussian function has been implemented as RBF.
Let ф j  x  be the jth radial basis function. ф j  x  is represented as:

 ( x  c j )2 
ф j  x   exp    (1)
 2б2j
 
76 D.K. Choubey and S. Paul

Here, x  ( x1, x2, . xd )T is the input vector, c j  (c1 j , c2 j , . cdj )T and б2j are the jth centre
vector and the width parameter, respectively. The output of RBF network Y , which is the
linear sum of radial basis function, is given as follows:
p
Y  W j ф j  x  (2)
j 1

where Y is the output of the RBF network, p is the number of the hidden layer neuron,
and Wj is the weight from jth neuron to the output layer. To construct RBF network, the
number of the hidden layer neuron m must be set, and the centres c j , the widths б j and
the weights Wj must be estimated. The learning in RBF NNs may be in two phase:
Phase I: Computing the centre of the RBF kernels and fixing their width.
Phase II: Use Delta rule to adjust the weights till convergence.
The selection of centres in static RBF networks in the following way:
 A set of random input patterns
 A set of grid points in the input space
 A set of optimal locations using clustering algorithms.
In RBF typical learning, the network structure will be determined based on prior
knowledge or the experiences of experts.

3 Related work

Polat and Gunes (2007) stated Principal Component Analysis (PCA) and Adaptive
Neuro-Fuzzy Inference System (ANFIS) to improve the diagnostic accuracy of diabetes
disease. In this PCA is used to reduce the dimensions of diabetes disease datasets features
and ANFIS diagnosis of diabetes disease means applying classification of the reduced
features of diabetes disease datasets. Seera and Lim (2014) introduced Fuzzy Min-Max
neural network, Classification and Regression Tree (CART), Random Forest (RF) for the
classification of medical data using hybrid intelligence system. The methodology is
implemented on various datasets including Breast Cancer Wisconsin, PIDD, and Liver
Disorders and performs better as compared to other existing techniques. Temurtas et al.
(2009) stated Levenberg-Marquardt (LM) algorithm and Probabilistic neural network
(PNN) were used to train a multilayer neural network, tenfold cross validation technique
were used for estimation of result. The used techniques LM, PNN, 10-Fold cross
validation provide better correct training pattern than conventional validation method.
Dogantekin et al. (2010) used Linear Discriminant Analysis (LDA) and ANFIS for
diagnosis of diabetes. LDA is used to separate feature variables between healthy and
patient (diabetes) data, and ANFIS is used for classification on the result produced by
LDA. The techniques used provide good accuracy then the previous existing results. So,
the physicians can perform very accurate decisions by using such an efficient tool.
Barakat et al. (2010) worked on the classification of diabetes disease using a machine
learning approach such as Support Vector Machine (SVM). The paper implements a new
and efficient technique for the classification of medical diabetes mellitus using SVM. A
sequential covering approach for the generation of rules of extraction is implemented
GA_RBF NN 77

using the concept of SVM which is an efficient supervised learning algorithm. The paper
also discusses Eclectic rule extraction technique for the extraction of rules set attributes
from the dataset such that the selected attributes can be used for classification of medical
diagnosis mellitus. Orkcu and Bal (2011) Backpropogation neural network, binary-coded
genetic algorithms, real-coded genetic algorithm for the classifications of medical
datasets. Aslam et al. (2013) implemented an expert system for the classification of
diabetes data using Genetic programming (GP). The technique implemented here consists
of three stages: the first stage includes feature selection using t-test and Kolmogorov-
Smirnov test and Kulback-Leibler divergence test, the next stage uses GP, which is used
for the non-linear combination of selected attributes from the first stage. In the final stage
the generated features using GP is compared with K-nearest neighbour (KNN) and SVM.
The classification is done on PIDD consists of 768 instance values in the dataset and 8
attributes and one output variable (class variable) which have either a value ‘1’ or ‘0’
available in the dataset. The selected features are then used for the classification of
diabetes patients with high accuracy of classification.
Lukka and Pasi (2011) used Fuzzy entropy measure, similarity classifier for the better
classification of diabetic disease. Fuzzy entropy used as a feature selection and similarity
classifier used for the classification on that selected features. The techniques used
provide much lower computation time, enhanced classification accuracy by the process
to reduce noise, reduced computational cost, more transparent and comprehensible by
removing insignificant features from the dataset. Polat et al. (2008) proposed uses a new
approach of a hybrid combination of Generalised discriminant analysis (GDA) and least
square support vector machine (LS-SVM) for the classification of diabetes disease. Here
the methodology is implemented in two stages: in the first stage pre-processing of the
data is done using the GDA such that the discrimination between healthy and patient
disease can be done. In the second stage LS-SVM technique is applied for the
classification of Diabetes disease patient’s. The methodology implemented here provides
accuracy about 78.21% based on 10 fold-cross validation from LS-SVM and the obtained
accuracy for classification is about 82.05%.
Selvakuberan et al. (2011) used Ranker search method, K star, REP tree, Naive
bayes, logisitic, dagging, multiclass in which ranker search approach is used for feature
selection and K star, REP tree, Naive bayes, logisitic, dagging, multiclass are used for
classification. The techniques implemented here provide a reduced feature set with
higher classification accuracy.
Qasem and Shamsuddin (2011) introduced a Time Variant Multi-Objective Particle
Swarm Optimisation (TVMOPSO) of Radial basis function (RBF) network for
diagnosing the medical disease. RBF networks training to determine whether RBF
networks can be developed using TVMOPSO, and the performance is validated based on
accuracy and complexity.
Kala et al. (2011) proposed a new methodology for the diagnosis of Breast Cancer
using the concept of Neural Networks. In this methodology a mixture of various expert
models are grouped to solve various problems. The decision from each of the individual
expert system is mixed to give a final output. The proposed architecture implemented
here is used for the solving of Breast Cancer Diagnosis by individually evolving neural
network into Genetic Algorithm (GA). The experimental results performed by this
methodology are highly scalable and provides efficient results on attributes and data
items. Sarfaraz et al. (2014) analysed and generate reports for the evaluation of the bio-
artificial liver reactor. Here in this paper Fuzzy Analytic Hierarchy Process (FAHP) is
78 D.K. Choubey and S. Paul

implemented for the uncertainty of detecting Bio-Artificial Liver (BAL). The


methodology implemented here is more scalable as compared to other existing
techniques. The method also provides an efficient final score of detecting bio-artificial
liver. Miller and Leroy (2008) also proposed a new and efficient technique for the
dynamic generation of Health Topics. The paper implemented a dynamic health topics
based web pages that maintains information of four consumer-preferred categories. The
methodology implemented here provides efficient precision of 82%, 75% recall and 78%
of f-score.

4 Proposed methodology

In the proposed approach GA is implemented and evaluated as an attribute selection and


RBF NN for classification on PIDD from UCI repository of machine learning databases.
The proposed work of algorithm and the next proposed block diagram is discussed
hereafter.

4.1 Proposed algorithm


Step1: Start
Step2: Load PIDD
Step3: Initialise the parameters for the GA
Step4: Call the GA
Step5.1: Construction of the first generation
Step5.2: Selection
While stopping criteria not met do
Step5.3: Crossover
Step5.4: Mutation
Step5.5: Selection
End
Step6: Apply RBF NN Classification
Step7: Training Dataset
Step8: Calculation of error and accuracy
Step9: Testing Dataset
Step10: Calculation of error and accuracy
Step11: Stop

The proposed approach works in the following phases:


1. The PIDD obtained from UCI repository of machine learning databases.
2. Used GA as an Attribute Selection on PIDD.
3. Do the Classification by using RBF NN on selected features in PIDD and all the
features.
GA_RBF NN 79

Figure 2 Block diagram of proposed work

4.1.1 Used diabetes disease dataset


The PIDD was obtained from the UCI Repository of Machine Learning Databases
(UCI Repository of Bioinformatics Databases, http://www.ics.uci.edu./~mlearn/
MLRepository.html). The same dataset has been used in the reference (Kayaer and
Yildirim, 2003; Goncalves et al., 2006; Polat and Gunes, 2007; Kahramanli and
Allahverdi, 2008; Temurtas et al., 2009; Dogantekin et al., 2010; Ganji and Abadeh,
2010; Jayalakshmi and Santhakumaran, 2010; Ephzibah, 2011; Ganji and Abadeh, 2011;
Kala et al., 2011; Karegowda et al., 2011; Lee, 2011; Lukka, 2011; Orkcu and Bal, 2011;
Qasem and Shamsuddin, 2011; Selvakuberan et al., 2011; Karatsiolis and Schizas, 2012;
Aslam et al., 2013; Das et al., 2013; Kalaiselvi and Nasira, 2014; Seera and Lim, 2014;
Choubey and Paul, 2015; Choubey and Paul, 2016) . The National Institute of Diabetes,
Digestive, and Kidney Diseases originally owned this data and received it on 9 May
1990. All Patients in this database are Pima Indian Women at least 21 years old and
living near Phoenix, Arizona, USA. The attributes of this database are given in below:
 Number of instances: 768
 Number of attributes: 8
 Attributes:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
80 D.K. Choubey and S. Paul

4. Triceps skin fold thickness (mm)


5. 2-hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/ (height in m)2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
There are eight all numeric- valued attributes and one output variable (Class variable)
which has either a value ‘1’ or ‘0’.
Class distribution (class value 1 is interpreted as “tested positive for diabetes”):
Class Value Number of instances
0 500 (65.1%)
1 268 (34.9%)

4.1.2 GA for attribute selection


The GA is a repetitive process of selection, crossover, and mutation with the population
of individuals in each iteration called a generation. Each chromosome or individual is
encoded in a linear string (generally of 0s and 1s) of fix length in genetic analogy. In
search space, first, the individual members of the population are randomly initialised.
After initialisation each population member is evaluated with respect to objective
function being solved and is assigned a number (value of the objective function) which
represents the fitness for survival for corresponding individual. The GA maintains a
population of fix number of individuals with corresponding fitness value. In each
generation, the more fit individuals (selected from the current population) go in mating
pool for crossover to generate new offsprings, and consequently individuals with high
fitness are provided more chance to generate offsprings. Therefore, each new offspring is
modified with a very low mutation probability to maintain the diversity in the population.
Following this, the parents and offsprings together forms the new generation based on the
fitness, which will be treated as parents for next generation. In this way, the new
generation and hence successive generations of individual solutions is expected to be
better in terms of average fitness. Either the algorithms stops forming new generations
when a maximum number of generations have been formed or a satisfactory fitness value
is achieved for the problem.
The standard pseudo code of GA is given in below Algorithm:
Algorithm GA
Begin
q=0
Randomly initialise individual members of population P(q)
Evaluate fitness of each individual of population P(q)
while termination condition is not satisfied do
q = q+1
GA_RBF NN 81

selection (of better fit solutions)


crossover (mating between parents to generate off-springs)
mutation (random change in off-springs)
end while
Return best individual in population;

In Algorithm 1, q represents the generation counter, initialisation is done randomly in


search space and corresponding fitness is evaluated based on objective function. After
that, GA algorithm requires a cycle of three phases: selection, crossover and mutation.
Each of the phases is explained in section 2.1 GA part.
In medical world, if diagnosed of any disease then there are some tests need to be
performed. After getting the results of the tests, the diagnosed could be done better. Each
and every test can be considered as an attribute. If a particular test needs to be done then
there are certain set of chemicals, equipments, may be people, more time required which
can be more expensive. Basically, Attribute selection informs whether a particular test is
necessary for the diagnosis or not. So, if a particular test is not required, if can be
avoided. When the number of tests gets reduced the cost that is required also gets
reduced which helps the common people. That is why here we have applied GA as a
Attribute selection by which we reduce four features from eight features. Therefore, from
the above it is clear that GA is reducing the cost, storage capacity and computation time
by selecting some of the attribute.

4.1.3 RBF NN for classification


RBF NN are supervised feed forward NN with one hidden layer method for
classification. Since the RBF NN is a supervised learning, it requires a desired response
to be trained.
The following steps occur in RBF NN:
1. Initialisation (Select two RBF units).
2. Output Calculation (Present an input-output pair, calculate network output and
error).
3. Weight Updation (Delta rule).
4. Centre adaptation (Based on parameters (ε1 and ε2).
5. Check for Convergence (exit if network error is less than threshold otherwise add a
RBF unit, iterate through steps 2 through 5).
The algorithm of RBF NN is as given below:
Algorithm RBF NN
1. Choose two initial centres
2. Compute network output:
p  ( x  c j )2 
Y W ф  x  where ф  x   exp  
j j j
2б2j 
 (3)
j1 
82 D.K. Choubey and S. Paul

p  ( x  c j )2 
Y  Wj exp    (4)
 2б2j
j1  
3. Calculate error e  D  Y (5)
where D  Desired output, Y  Actual output

4. Set learning parameter values ε1 using the Heuristic table calculate,

 2  12  12 (6)

5. Move Centres

Find Best Matching Unit (BMU) using x c  j (7)

Move BMU: c jbmu  new   c jbmu  old   1  c j  x  (8)


Move other centres:

c jneighbour  new   c jneighbour  old    2  c jneighbour  x  (9)


6. Perform Weight updation

Wij  new   Wij  old   n  D  Y   j (10)


7. Insert a new centre if error between successive epochs does not fall below the
threshold
8. Repeat 2 through 7 until classification is achieved.

The general methodology involves the division of database into training and testing data
sets. The training data set is used for training the system and testing data set is used to
measure the performance.
The working of RBF NN is summarised in steps as follows:
Phase I: Training the RBF NN
Step1: Collect a Data Set.
Step 2: Divide the dataset into training and test.
Step3: Set the training parameters (such as learning rate, momentum, etc.).
Step4: Train the RBF NN Structures.
Step5: Obtained accuracy and the weights between the layers.
Phase II: Testing Process
Step1: Obtain test dataset.
Step2: Apply the test dataset to the trained RBF NN classifier.
Step3: Obtain the classification results.
GA_RBF NN 83

Figure 3 The RBF NN methodology

5 Results and discussion of proposed methodology

The work was implemented on i3 processor with 2.30 GHz speed, 2 GB RAM, 320 GB
external storage and software used JDK 8 (Java Development Kit), NetBeans 8.0 IDE
and has been done coding in java. For the computation of RBF NN and various
parameters of Weka library is used. In Experimental studies the dataset partition 70–30%
(538–230) for training & test of RBF NN, GA_RBF NN method for diagnosis
of diabetes. The experimental studies have been performed on PIDD mentioned in
section 4.1.1.
The results compared the proposed system i.e. GA_RBF NN, RBF NN with the
previous results reported by earlier methods (Ganji and Abadeh, 2011; Seera and Lim,
2014). As per Table 2, it can seen that by applying the GA approach, 4 attributes have
been obtained among 8 attributes. This means that the cost have reduced to s(x) =
4/8 = 0.5 from 1 and an improvement on the training and classification by a factor of 2.
84 D.K. Choubey and S. Paul

As we know that diagnostic performance is usually evaluated in terms of


Classification Accuracy, Precision, Recall, Fallout, F-Measure, ROC, Confusion Matrix,
Kappa Statistics, Mean Absolute Error (MAE), Root Mean Square Error (RMSE),
Relative Absolute Error (RAE), Root Relative Squared Error (RRSE). These terms are
briefly explained below.
Classification Accuracy: Classification accuracy may be defined as the probability of
it in correctly classifying records in the test datasets or classification accuracy is the ratio
of total number of correctly diagnosed cases to the total number of cases.
Classification accuracy  %    TP  TN  /  TP  FP  TN  FN  (11)
TP (True Positive): Diabetic people correctly detected as diabetic people.
FP (False Positive): Healthy people incorrectly detected as diabetic people.
TN (True Negative): Healthy people correctly detected as healthy people.
FN (False Negative): Diabetic people incorrectly detected as healthy people.
Precision: Precision may be defined as the measures of the rate of correctly classified
samples that are predicted as diabetic samples or precision is the ratio of number of
correctly classified instances to the total number of instances fetched.
No.of Correctly Classified Instances
Precision  (12)
Total No.of Instances Fetched
or Precision  TP / TP  FP (13)
Recall: Recall may be defined as the measures of the rate of correctly classified samples
that are actually diabetic samples or recall is the ratio of number of correctly classified
instances to the total number of instances in the Dataset.
No.of Correctly Classified Instances
Recall  (14)
Total No.of Instances in the Dataset
or Recall  TP / TP  FN (15)
Fallout: The term fallout is used to check true negative of the dataset during
classification.
F-Measure: The F-Measure computes some average of the information retrieval
precision and recall metrics. The F-Measure (F-Score) is calculated based on the
precision and recall. It is a trade-off between precision and recall. It is the harmonic –
mean of precision and recall. The calculation is as follow:
2*Precision*Recall
F  Measure  (16)
Precision  Recall
Area under Curve (AUC): It is defined as the metric used to measure the performance of
classifier with relevant acceptance. It is calculated from area under curve (ROC) based on
true positives and false positives.
1  TP TN 
AUC     (17)
2  TP  FN TN  FP 
ROC (Receiver operating curve graph) is an effective method of evaluating the
performance of diagnostic tests. ROC is defined as a plot of test or relationship between
Sensitivity or True Positive Rate (TPR) as the Y coordinate and 1-Specificity or False
Positive Rate (FPR) as the X coordinate
GA_RBF NN 85

Confusion Matrix: A confusion matrix (Polat and Gunes, 2007; Polat et al., 2008)
contains information regarding actual and predicted classifications done by a
classification system.
Kappa statistics: It is defined as performance to measure the true classification or
accuracy of the algorithm.
TO  TC
K (18)
1  TC
where TO is the total agreement probability and TC is the agreement probability due to
change.
Mean Absolute Error (MAE): MAE means the average of the absolute errors. MAE is
a quantity used to measure how close forecasts or predictions are to the eventual
outcomes. MAE is a common measure of forecasts errors. MAE can be compared
between models whose errors are measured in the same units. It is usually similar in
magnitude to RMSE, but slightly smaller.
It is defined as:
t1  q1  tn  qn
MAE  (19)
n
Root Mean-Squared Error (RMSE): The square root of the mean /average of the square
of all of the error. RMSE is used to assess how well a system learns a given model.
RMSE can be compared between models whose errors are measured in same units.
It is defined as:

 t1  q1    tn  qn 
2 2

RMSE  (20)
n
Relative Absolute Error (RAE): Like RSE, RAE can be compared between models
whose errors are measured in the different units.
It is defined as:
t1  q1  tn  qn
RAE  (21)
q  q1  q  q

Relative Squared Error (RSE): Unlike RMSE, RSE can be compared between models
whose errors are measured in different units.
It is defined as:

 t  q    tn  qn 
2 2

RSE  1 1 2 (22)
 q  q1    q  qn 
2

Where, q1, q2,qn , are the actual target values and t1, t2,tn , are the predicted target values.
The time taken to build model training set evaluation= 0.52 seconds, and time taken
to build model testing set evaluation = 0.34 seconds for RBF NN method. Table 1 shows
the results of both the training set and testing set evaluation by using RBF NN method
for PIDD based on some parameters, which is noted below:
86 D.K. Choubey and S. Paul

Cofusion Matrix for Training set


a b <--classified as
288 61| a = tested_ negative
83 106 | b = tested_ positive
Cofusion Matrix for Testing set
a b <--classified as
128 23 | a = tested_ negative
32 47 | b = tested_positive
Table 1 Results of RBF NN for PIDD

Measure Training set evaluation Testing set evaluation


Precision 0.727 0.756
Recall 0.732 0.761
F-Measure 0.728 0.757
Accuracy 73.2342% (0.732342) 76.087% (0.76087)
ROC 0.802 0.813
Kappa statistics 0.3966 0.455
Mean Absolute Error 0.3415 0.3337
Root Mean-Squared Error 0.4136 0.4057
Relative Absolute Error 74.9038% 73.5706%
Root Relative Squared Error 86.6476% 85.4247%

Figure 4 is the ROC graph for tested_positive class by using RBF NN method on PIDD.
It may be seen that generating less error rate.

Figure 4 ROC graph for tested_positive class by using RBF NN method on PIDD
GA_RBF NN 87

Table 2 shows the attribute selection by using GA on PIDD.


Table 2 GA attribute selection

Number of No. of No. of


Data set Name of attributes
attributes Instances Classes
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test
3. Diastolic blood pressure
Pima Indian
Diabetes Dataset 8 4. Triceps skin fold thickness 768 2
(Without GA) 5. 2-hour serum insulin
6. Body mass index
7. Diabetes pedigree function
8. Age (years)
1. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test
Pima Indian
Diabetes Dataset 4 2. 2 – hour serum insulin 768 2
(With GA) 3. Body mass index
4. Age (years)

The time taken to build model training set evaluation= 0.21 seconds, and time taken
to build model testing set evaluation = 0.11 seconds for GA_RBF NN methodology.
Table 3 shows the results of both the training set and testing set evaluation by using RBF
NN method for PIDD on the selected attributes by using GA based on some parameters,
which is noted below:
Cofusion Matrix for Training set
a b <--classified as
307 42| a = tested_ negative
84 105 | b = tested_ positive
Cofusion Matrix for Testing set
a b <--classified as
133 18 | a = tested_ negative
34 45 | b = tested_ positive
Table 3 Results of GA_RBF NN for PIDD

Measure Training set evaluation Testing set evaluation


Precision 0.76 0.768
Recall 0.766 0.774
F-Measure 0.758 0.767
Accuracy 76.5799% (0.765799) 77.3913 % (0.773913)
ROC 0.824 0.848
88 D.K. Choubey and S. Paul

Table 3 Results of GA_RBF NN for PIDD (continued)

Measure Training set evaluation Testing set evaluation


Kappa statistics 0.4586 0.4733
Mean Absolute Error 0.3227 0.3108
Root Mean-Squared Error 0.4026 0.3871
Relative Absolute Error 70.7784% 68.522%
Root Relative Squared
84.3383% 81.5054%
Error

Figure 5 is the ROC graph for tested_positive class by using GA_RBF NN methodology
on PIDD. Figure 5 indicates that GA_RBF NN generates less error rate as compared to
Figure 4.

Figure 5 ROC graph for tested_positive class by using GA_RBF NN methodology on PIDD

Table 4 shows the analysis of comparison result with and without GA on RBF NN for
PIDD by several measures along with several methods i.e. noted in table.
Table 4 Evaluation of RBF NN & GA_RBF NN along with several existed method
Performance for PIDD

J48graft DT GA_J48graft MLP NN GA_MLP NN


Choubey DT Choubey Choubey GA _ RBF
Measure RBF NN
and Choubey and and and NN
Paul (2015) Paul (2015) Paul (2016) Paul (2016)
Precision 0.761 0.789 0.781 0.79 0.756 0.768
Recall 0.765 0.748 0.783 0.791 0.761 0.774
F – Measure 0.762 0.754 0.77 0.78 0.757 0.767
76.5217% 74.7826% 78.2609 % 79.1304% 76.087% 77.3913%
Accuracy
(0.765217) (0.747826) (0.782609) (0.791304) (0.76087) (0.773913)
GA_RBF NN 89

Table 4 Evaluation of RBF NN & GA_RBF NN along with several existed method
Performance for PIDD (continued)

J48graft DT GA_J48graft MLP NN GA_MLP NN


Choubey DT Choubey Choubey GA _ RBF
Measure RBF NN
and Choubey and and and NN
Paul (2015) Paul (2015) Paul (2016) Paul (2016)
ROC 0.765 0.786 0.853 0.842 0.813 0.848
Kappa statistics 0.4665 0.4901 0.4769 0.5011 0.455 0.4733
MAE 0.3353 0.3117 0.2716 0.2984 0.3337 0.3108
RMSE 0.4292 0.4114 0.387 0.387 0.4057 0.3871
73.9186% 68.7038% 59.8716% 65.7734%
RAE 73.5706% 68.522%
(0.739186) (0.687038) (0.598716) (0.657734)
90.3686% 86.6146% 81.4912% 81.4774%
RRSE 85.4247% 81.5054%
(0.903686) (0.866146) (0.814912) (0.814774)

In Table 4, it may be seen that with GA the improvement has occurred in every measure
in the case of RBF NN. In the above table, mentioned methods i.e. J48graft DT,
GA_J48graft DT, MLP NN, GA_MLP NN implemented by Dilip Kumar Choubey et al.
have mentioned Precision, Recall, F-Measure, Accuracy, ROC value results in the
publication but not the Kappa statistics, MAE, RMSE, RAE, RRSE. So once again went
to implement the above-mentioned methods to find the not available value results i.e.
Kappa statistics, MAE, RMSE, RAE, RRSE.
Figure 6 is the analysis of comparison result with and without GA on several methods
i.e. J48graft DT, MLP NN, and RBF NN for PIDD.

Figure 6 Evaluation of J48graft DT, GA_J48graft DT, MLP NN, GA_MLP NN, RBF NN and
GA_RBF NN Performance for PIDD

Figure 6 is representing Table 5 measures in chart graphical or histogram form and this is
indicating the difference in more precise form between several methods as already
mention in table.
90 D.K. Choubey and S. Paul

Table 5 Results and comparison of accuracy with other existed methods for the PIDD

Source Method Accuracy


Sim 75.29%
Luukka (2011) Sim + F1 75.84%
Sim + F2 75.97%
Binary – coded GA 74.80%
Orkcu et al. (2011) BP 73.80%
Real – coded GA 77.60%
FMM 69.28%
Seera et al. (2014) FMM – CART 71.35%
FMM-CART – RF 78.39%
Choubey and Paul J48graft DT 76.5217%
(2015) GA_J48grft DT 74.7826 %
Choubey and Paul MLP NN 78.2609 %
(2016) GA_MLP NN 79.1304%
RBF NN 76.087%
Our Study
GA _RBF NN 77.3913%

There are already several method existing which has been implemented on PIDD.
Table 5 shows the result comparison in terms of accuracy on PIDD for the diagnosis of
diabetes.
Table 6 shows the result comparison in terms of ROC on PIDD for the diagnosis of
diabetes. It may be seen in Table 6 that the proposed method provides better ROC than
almost other existing method.
Table 6 Results and comparison of ROC with other existed methods for the PIDD

Source Method ROC


Sim 0.762
Luukka (2011) Sim + F1 0.703
Sim + F2 0.667
FMM 0.661
Seera et al. (2014) FMM – CART 0.683
FMM-CART – RF 0.732
Choubey and Paul J48graft DT 0.765
(2015) GA_J48grft DT 0.786
Choubey and Paul MLP NN 0.853
(2016) GA_MLP NN 0.842
RBF NN 0.813
Our Study
GA _RBF NN 0.848
GA_RBF NN 91

6 Discussion and future directions

Diabetes means blood sugar is above desired level on a sustained basis. This is one of the
most world’s widespread diseases, now a day’s very common. According to “Diabetes
Atlas 2013” released by the International Diabetes Federation, there are 382 million
people in the world with diabetes and this is projected to increase to 592 million by the
year 2035. After China (98.4 million), India has the largest numbers of individuals with
diabetes in the world (65.1 million). Diabetes contributes to blindness, blood pressure,
heart disease, kidney disease and nerve damage, etc. which is hazardous to health. In this
paper, firstly the classification has been done on PIDD by using RBF NN, and then using
GA for Attributes selection, and there by performed classification on the selected
attributes. The proposed method minimises the computation cost, computation time and
maximises the ROC and classification accuracy than almost several other existing
methods as we may see in Tables 5 and 6. From Table 4 it is clear that with attribute
selection method (GA), the improvement has occurred in every measure. The proposed
method will help physicians to improve, or take accurate decisions to do work speedily.
For the future research work, we suggest to develop an expert system of diabetes
disease, which will provide good ROC, classification accuracy, precision, recall, F-
Measure, Kappa statistics, MAE, RMSE, RAE, RRSE, and this is possible to achieve
only by using different Attribute selection and classification method which, could
significantly decrease healthcare costs via early prediction and diagnosis of diabetes
disease. The proposed method can also be used for other kinds of diseases but not sure
that in all the medical diseases either same or greater than the existing results in this
paper. Results that are more interesting may also happen for the exploration of the dataset
also.

References
Aslam, M. W., Zhu, Z. and Nandi, A.K. (2013) ‘Feature generation using genetic programming
with comparative partner selection for diabetes classification’, Elsevier: Expert Systems with
Applications, Vol. 40, pp.5402–5412.
Barakat, N.H., Bradley, A.P. and Barakat, M.N.H. (2010) ‘Intelligible support vector machines for
diagnosis of diabetes mellitus’, IEEE Transactions on Information Technology in
Biomedicine, Vol. 14, No. 4, pp.1114–1120.
Choubey, D.K. and Paul, S. (2015) ‘Classification techniques for diagnosis of diabetes disease: a
review’, International Journal of Biomedical Engineering and Technology, ISSN: 1752-6418
(Print), ISSN: 1752-6426.
Choubey, D.K. and Paul, S. (2015) ‘GA_J48graft DT: a hybrid intelligent system for diabetes
disease diagnosis’, SERSC: International Journal of Bio-Science and Bio-Technology, ISSN:
2233-7849, Vol. 7, No. 5, pp.135–150.
Choubey, D.K. and Paul, S. (2016) ‘GA_MLP NN: a hybrid intelligent system for diabetes disease
diagnosis’, MECS: International Journal of Intelligent Systems and Applications, Vol. 8,
No. 1, pp.49–59.
Choubey, D.K., Paul, S. and Bhattacharjee, J. (2014) ‘Soft computing approaches for diabetes
disease diagnosis: a survey’, International Journal of Applied Engineering Research, Vol. 9,
pp.11715–11726.
Das, S., Ghosh, P.K. and Kar, S. (2013) ‘Hypertension diagnosis: a comparative study using fuzzy
expert system and neuro fuzzy system’, IEEE.
92 D.K. Choubey and S. Paul

Darwin, C. (1859) On the Origins of Species by Means of Natural Selection, Murray, London, UK.
Dogantekin, E., Dogantekin, A., Avci, D. and Avci, L. (2010) ‘An intelligent diagnosis system for
diabetes on linear discriminant analysis and adaptive network based fuzzy inference system:
LDA – ANFIS’, Digital Signal Processing, Vol. 20, No. 4, pp.1248–1255.
Ephzibah, E.P. (2011) ‘Cost effective approach on feature selection using genetic algorithms and
fuzzy logic for diabetes diagnosis,’ International Journal on Soft Computing, Vol. 2, No. 1.
Ganji, M.F. and Abadeh, M.S. (2011) ‘A fuzzy classification system based on ant colony
optimization for diabetes disease diagnosis,’ Elsevier: Expert Systems with Applications,
Vol. 38, pp.14650–14659.
Ganji, M.F. and Abadeh, M.S. (2010) ‘Using fuzzy ant colony optimization for diagnosis of
diabetes disease’, IEEE: Proceedings of ICEE, 11–13 May.
Goldberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-
Wesley, Reading, MA.
Goncalves, L.B., Bernardes, M.M. and Vellasco, R. (2006) ‘Inverted hierarchical neuro-fuzzy BSP
system: a novel neuro-fuzzy model for pattern classification and rule extraction in databases’,
IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews,
Vol. 36, No. 2.
Holland, J.H. (1975) Adaptation in Natural and Artificial Systems, The University of Michigan
Press, Ann Arbor, MI.
Jayalakshmi, T. and Santhakumaran, A. (2010) ‘A novel classification method for diagnosis of
diabetes mellitus using artificial neural networks’, International Conference on Data Storage
and Data Engineering (DSDE), Bangalore, India, pp.159–163.
Kahramanli, H. and Allahverdi, N. (2008) ‘Design of a hybrid system for the diabetes and heart
diseases’, Elsevier: Expert Systems with Applications, Vol. 35, pp.82–89.
Kala, R., Janghel, R.R., Tiwari, R. and Shukla, A. (2011) ‘Diagnosis of breast cancer by modular
evolutionary neural networks’, Inderscience: International Journal of Biomedical Engineering
and Technology, Vol. 7, No. 2, pp.194–211.
Kalaiselvi, C. and Nasira, G.M. (2014) ‘A new approach for diagnosis of diabetes and prediction
of cancer using ANFIS’, IEEE: World Congress on Computing and Communication
Technologies.
Karatsiolis, S. and Schizas, C.N. (2012) ‘Region based support vector machine algorithm for
medical diagnosis on pima indian diabetes dataset’, Proceedings of the IEEE 12th
International Conference on Bioinformatics & Bioengineering (BIBE), Larnaca, Cyprus,
pp.11–13.
Karegowda, A.G., Manjunath, A.S. and Jayaram, M.A. (2011) ‘Application of genetic algorithm
optimized neural network connection weights for medical diagnosis of pima indians diabetes’,
International Journal on Soft Computing, Vol. 2, No. 2.
Kayaer, K. and Yildirim, T. (2003) ‘Medical diagnosis on pima indian diabetes using general
regression neural networks’, IEEE.
Lee, C.-S. (2011) ‘A fuzzy expert system for diabetes decision support application’, IEEE:
Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 41, No. 1,
pp.139–153.
Lukka, P. (2011) ‘Feature selection using fuzzy entropy measures with similarity classifier’,
Elsevier: Expert Systems with Applications, Vol. 38, pp.4600–4607.
Michalewicz, Z. (1996) Genetic Algorithms + Data Structures = Evolution Programs, Springer.
Miller, T. and Leroy, G. (2008) ‘Dynamic generation of a health topics overview from consumer
health information documents’, International Journal of Biomedical Engineering and
Technology, Vol. 1, No. 4, pp.395–414.
Orkcu, H.H. and Bal, H. (2011) ‘Comparing performances of backpropagation and genetic
algorithms in the data classification’, Elsevier: Expert Systems with Applications, Vol. 38,
pp.3703–3709.
GA_RBF NN 93

Polat, K. and Gunes, S. (2007) ‘An expert system approach based on principal component analysis
and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease’, Elsevier: Digital
Signal Processing, Vol. 17, pp.702–710.
Polat, K., Gunes, S. and Arslan, A. (2008) ‘A cascade learning system for classification of diabetes
disease: generalized discriminant analysis and least square support vector machine’, Elsevier:
Expert Systems with Applications, Vol. 34, pp.482–487.
Qasem, S.N. and Shamsuddin, S.M. (2011) ‘Radial basis function network based on time variant
multi objective particle swarm optimization for medical diseases diagnosis’, Elsevier: Applied
Soft Computing, Vol. 11, pp.1427–1438.
Sarfaraz, A., Bonk, R. and Jenab, K. (2014) ‘A bio-artificial liver reactor evaluation method’,
Inderscience: International Journal of Biomedical Engineering and Technology, Vol. 14,
No. 1, pp.1–12.
Seera, M. and Lim, C.P. (2014) ‘A hybrid intelligent system for medical data classification’, Expert
Elsevier: Systems with Applications, Vol. 41 pp.2239–2249.
Selvakuberan, K., Kayathiri, D., Harini, B. and Devi, M.I. (2011) ‘An efficient feature selection
method for classification in health care systems using machine learning techniques’, IEEE.
Temurtas, H., Yumusak, N. and Temurtas, F. (2009) ‘A comparative study on diabetes disease
diagnosis using neural networks’, Elsevier: Expert Systems With Applications, Vol. 36,
pp.8610–8615.
UCI Repository of Bioinformatics Databases [online] Available online at: http://www.ics.uci.edu./
~mlearn/MLRepository.html

View publication stats

You might also like