You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 53

Predicting Patients with Heart Disease by


Using an Improved Back-propagation
Algorithm
Nazri Mohd Nawi, Rozaida Ghazali and Mohd Najib Mohd Salleh

Abstract— A study on improving training efficiency of Artificial Neural Networks algorithm was carried out throughout many
previous works. This paper presents a new approach to improve the training efficiency of back propagation neural network
algorithms. The proposed algorithm (GDM/AG) adaptively modifies the gradient based search direction by introducing the
value of gain parameter in the activation function. It has been shown that this modification significantly enhance the
computational efficiency of training process. The proposed algorithm is generic and can be implemented in almost all gradient
based optimization processes. The robustness of the proposed algorithm is shown by comparing convergence rates and the
effectiveness of gradient descent methods using the proposed method on heart disease data.

Index Terms— Back propagation, Search direction, adaptive gain, effectiveness, computational efficiency.

——————————  ——————————
1 INTRODUCTION

T he back-propagation algorithm has been the most


popular and most widely implemented algorithm
thods of Fletcher and Powel [14] and the Fletcher-
Reeves [15] that improve the conjugate gradient me-
thod of Hestenes and Stiefel [16] and the family of
for training these types of neural network. When us-
Quasi-Newton algorithms proposed by Huang [17].
ing the back-propagation algorithm to train a multi- This research suggests that a simple modification to
layer neural network, the designer is required to arbi- the gradient based search direction used by almost all
trarily select parameter such as the network topology, optimization method that has been summarized by
initial weights and biases, a learning rate value, the Bishop [13] can substantially improve the training
activation function, and a value for the gain in the efficiency. The gradient based search direction is lo-
cally modified by a gain value used in the activation
activation function. Improper selection of any of these
function of the corresponding node to improve the
parameters can result in slow convergence or even convergence rates respective of the optimization algo-
network paralysis where the training process comes rithm used.
to a virtual standstill. Another problem is the ten- The remaining of the paper is organized as fol-
dency of the steepest descent technique, which is used lows: Section two states the research objectives. Sec-
in the training process, can easily get stuck at local tion three illustrates the proposed method and the
minima. implementation of the proposed method in gradient
Recently, improving training efficiency of back- descent optimization process. In Section four, the ro-
propagation neural network based algorithm is an bustness of proposed algorithm is shown by compar-
active area of research and numerous papers have ing convergence rates for gradient descent methods
been proposed in the literature. Early research on on Cleveland Heart Disease data. The paper is con-
back propagation algorithms saw improvements on: cluded in the final section along with short discussion
(i) selection of better error functions [1-8]; (ii) different on further research.
choices for activation functions [3, 9] and, (iii) selec-
tion of dynamic learning rate and momentum [10-12].
Later, as summarized by Bishop [13], various op- 2. RESEARCH OBJECTIVES
timization techniques were suggested for improving
efficiency of the error minimization process or in oth- This research will demonstrate the robustness of
er words the training efficiency. Among these are me-
the proposed algorithm by comparing its convergence
———————————————— rates in predicting patients diagnosed with heart dis-
• N.Mohd Nawi is with Universiti tun Hussein Onn Malaysia, 86400,
Parit Raja, Batu Pahat, MALAYSIA.
ease. The data used in this research is based on Cleve-
• R. Ghazali is with Universiti tun Hussein Onn Malaysia, 86400, Parit land Heart Disease data. The proposed algorithm sig-
Raja, Batu Pahat, MALAYSIA.
nificantly enhances the computational efficiency of
• M.N. Mohd Salleh is with Universiti tun Hussein Onn Malaysia,
86400, Parit Raja, Batu Pahat, MALAYSIA. the training process. The proposed algorithm is gener-
ic and can be implemented in almost all gradient
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 54

based optimization processes. This research will im- layer s , and let o = [o ...o ] be the column vector
s s
1
s T
n
prove data mining techniques particularly Artificial of activation values in the layer s and the input
layer as layer 0. Let wijs be the weight values for
Neural Network (ANN) in extracting hidden know-
the connecting link between the i th node in layer
ledge (patterns and relationship) associated with s − 1 and the j th node in layer s , and let
heart disease from a historical heart disease database w sj = [ w1s j ...wnjs ]T be the column vector of weights
efficiently. from layer s − 1 to the j th node of layer s . The net
input to the j th node of layer s is defined as

net sj = ( w sj , o s −1 ) = k w sj ,k oks −1 , and let
3. THE PROPOSED METHOD net s = [net1s ...net ns ]T be the column vector of the
net input values in layer s . The activation value
for a node is given by a function of its net inputs
In this section, a novel approach for improving the and the gain parameter c sj ;
training efficiency of back propagation neural net- o sj = f (c sj net sj ) (2)
work algorithms is proposed. The proposed algo-
rithm modifies the initial search direction by chang- where f is any function with bounded deriva-
ing the gain value adaptively for each node. The fol- tive.
lowing subsection describes the algorithm. The ad- This information is now used to derive an expres-
vantages of using an adaptive gain value have been sion for modifying gain values for the next epoch.
Most of gradient based optimization methods use the
explored. Gain update expressions as well as weight
following gradient descent rule:
and bias update expressions for output and hidden ∂E (3)
nodes have also been proposed. These expressions ∆wij( n ) = −η ( n )
∂wij( n )
have been derived using same principles as used in
where η (n ) is the learning rate value at step n and
deriving weight updating expressions.
the gradient based search direction at step n is
The following iterative algorithm has been pro- ∂E
d (n) = − (n) = g (n) .
posed for changing the gradient based search direc- ∂
In the proposed
wij method the gradient based
tion using a gain value. search direction is modified by including the
variation of gain value to yield
Initialize the initial weight vector with random values and ∂E (4)
d ( n ) = − ( n ) (c (jn ) ) = g ( n ) (c (jn ) )
the vector of gain values with unit values. Repeat the fol- ∂wij
lowing steps 1 and 2 on an epoch-by-epoch basis until the The derivation of the procedure for calculating the
given error minimization criteria are satisfied. gain value is based on the gradient descent algorithm.
Step 1 By introducing gain value into activation function, The error function as defined in Equation (1) is diffe-
rentiated with respect to the weight value wijs . The
calculate the gradient of error with respect to
chain rule yields,
weights by using Equation (5), and gradient of error
∂E ∂E ∂net s +1 ∂o j ∂net j
s s

with respect to the gain parameter by using Equa- = . . .


∂wijs ∂net s +1 ∂o sj ∂net sj ∂wijs
tion (7)
Step 2 Use the gradient weight vector and gradient of gain  w1s +j 1  (5)
vector calculated in step 1 to calculate the new s +1  
= [−δ1 ... − δ n ]. . f ' (c j net j )c j .o j
s +1 s s s s −1

weight vector and vector of new gain values for use  w s +1 


in the next epoch.  nj 
where δ s = − ∂E . In particular, the first three
∂net sj
factors of Equation (5) indicate that the follow-
j

ing equation holds:


3.1 Derivation of the expression to calculate gain
value
δ1s = (∑ δ ks +1wks +, j1 ) f ' (c sj net sj )c sj (6)
Consider a multilayer feed-forward network, as
k

used in standard back propagation algorithm[18]. It should be noted that, the iterative formula
Suppose that for a particular input pattern o 0 , the de- as described in Equation (6) to calculate δ1s is
sired output is the teacher pattern t = [t1...t n ]T , and the same as used in the standard back propaga-
the actual output is okL , where L denotes the output tion algorithms [18] except for the appearance of
layer. The error function on that pattern is defined as, the gain value in the expression. The learning
1 (1) rule for calculating weight values as given in
E=
2

(t − o L ) 2
k k k
Equation (3) is derived by combining (5) and (6).
In this approach, the gradient of error with respect
Let oks be the activation values for the k th node of
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 55

to the gain parameter can also be calculated by using dataset (364 records). The records for each set were
the chain rule as previously described; it is easy to selected randomly in order to avoid bias. For consis-
compute as
tency, only categorical attributes are used for neural
∂E (7)
= (∑ δ ks +1 wks ,+j1 ) f ' (c sj net sj )net sj networks model. All the medical attributes in Table 1
∂c sj k
were transformed from numerical to categorical data.
Then the gradient descent rule for the gain value The attribute “Diagnosis” was identified as the pre-
becomes,
dictable attribute with value ‘1’ for patients with heart
net sj (8)
∆ c sj = ηδ js s disease and value ‘0’ for patients with no heart dis-
cj ease. The attribute ‘PatientID’ was used as the key,
At the end of every epoch the new gain value is the rest were used as input attributes. It is assume
updated using a simple gradient based method that missing, inconsistent and duplicate data have
as given by the following formula, been resolved.
c new
j = c old
j + ∆c sj (9)
TABLE 1. Description of attributes
Predictable Attribute
1. Diagnosis (Value 0: <50% diameter narrowing (no
3.2 Implementation of the proposed method with heart disease); value 1:> 50% diameter narrowing
gradient descent method (has heart disease))

In gradient descent method, the search direction at Key Attribute


each step is given by the local negative gradient of the 1. PatientID – Patient’s identification number
error function, and the step size is determined by a
learning rate parameter. Suppose at step n in gra- Input Attributes
dient descent algorithm, the current weight vec- 1. Sex (value 1: Male; Value 0: Female)
tor is w n , and a particular gradient based search
2. Chest Pain Type (Value 1: Typical type 1 angina,
direction is d n . The weight vector at step n+1 is Value 2: typical type angina, Value 3: non-angina
computed by the following expression:
pain; Value 4: asymptomatic)
3. Fasting Blood Sugar (Value 1:> 120 mg/dll; value 0:
w ( n +1) = w n + η n d n (10) < 120 mg/dll)
4. Restecg – resting electrographic results (value0:
normal; value 1: 1 having ST-T wave abnormality;
where , η is the learning rate value at step n .
n
value 2: showing probable or define left ventricular
By using the proposed method namely as Back- hypertrophy)
propagation Gradient Descent Method with Adaptive 5. Exang – exercise induced angina (value 1: YES;
Gain Variation (BPGD/AG) [19], the gradient based value 0: NO)
search direction is calculated at each step by using 6. Slope – the slope of the peak exercise ST segment
(value 1: unsloping; value 2: flat; value 3: down-
Equation (4).
sloping)
7. CA – number of major vessels colored by flourso-
py (value 0-3)
8. Thal (value 3: normal; value 6: fixed defect; value 7:
4. RESULTS AND DISCUSSIONS 9. reversible defect)
10. Trest Blood Pressure (mm Hg on admission to the
4.1 Preliminaries 11. hospital)
Serum Chlolestoral (mg/dll)
Thalach – maximum heart rate achieved
The performance criteria used to asses the result of
12. Oldpeak – ST depression induced by exercise rela-
proposed method focuses on; (i) the speed of conver-
13. tive to rest
gence, measured in number of iterations as well as the Age in Year
corresponding CPU time and (ii) the effectiveness of
models that gave the highest percentage of correct The simulations have been carried out on a Pen-
predictions for diagnosing patients with heart disease. tium IV with 3 GHz PC, 1 GB RAM and using MAT-
A total of 909 records with 15 attributes (factors) LAB version 6.5.0 (R13). The following three algo-
were obtained from the Cleveland Heart Disease da- rithms were analyzed and simulated on the datasets.
1) The standard gradient descent with momen-
tabase [20]. The records were split equally into two
tum (traingdm) from ‘Matlab Neural Network
datasets: training dataset (545 records) and testing
Toolbox version 4.0.1’.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 56

2) The standard Gradient descent with momen- 1 26 146


tum (GDM)
3) The Gradient descent with momentum and Counts for GDM
Adaptive Gain (GDM/AG) Predicted 0 (Actual) 1 (Actual)
For comparison with other standard optimization al- 0 211 24
gorithms from the MATLAB neural network toolbox, 1 35 184
network parameters such as network size and archi-
tecture (number of nodes, hidden layers etc), values Counts for GDM/AG
for the initial weights and gain parameters were kept Predicted 0 (Actual) 1 (Actual)
same. The research only focused on the neural net- 0 211 20
work with one hidden layer with five hidden nodes 1 35 188
and sigmoid activation function was used for all Fig.1. Results of Classification Matrix for the three al-
nodes. All algorithms were tested using the same ini-
gorithms
tial weights that were initialized randomly from
range [0, 1] and received the input patterns for train-
ing in the same sequence.
For gradient descent algorithm, the learning rate val- TABLE 2. Algorithms Results
ue was 0.3 and the momentum term value was 0.7.
The initial value used for the gain parameter was one. Model Description No. of Predic-
For each run, the numerical data is stored in two cases tion
files- the results file, and the summary file. The result Patients with heart disease, 146 correct
file lists data about each network. The number of ite- predicted as having heart
rations until convergence is accumulated for each al- disease
gorithm from which the mean, the standard deviation Patients with no heart disease, 26 Incor-
and the number of failures are calculated. The net- predicted as having heart rect
Traingdm

disease
works that fail to converge are obviously excluded
Patients with no heart disease, 220 Correct
from the calculations of the mean and standard devia-
predicted as having no heart
tion but are reported as failures. disease
Patients with heart disease, 62 Incor-
predicted as having no heart rect
4.2 Validating Algorithm Effectiveness disease
Patients with heart disease, 184 correct
predicted as having heart
The effectiveness of each algorithm was tested us-
disease
ing Classification Matrix which displays the frequen- Patients with no heart disease, 35 Incor-
cy of correct and incorrect predictions by comparing predicted as having heart rect
disease
GDM

the actual values in the test dataset with the predicted


Patients with no heart disease, 211 Correct
values in the trained algorithm. In the example, the
predicted as having no heart
test dataset contained 208 patients with heart disease disease
and 246 patients without heart disease. Figure 1 Patients with heart disease, 24 Incor-
shows the results of the classification matrix for all predicted as having no heart rect
disease
three algorithms. The rows represent predicted values
Patients with heart disease, 188 correct
while the columns represent actual values (‘1’ for pa- predicted as having heart
tients with heart disease, ‘0’ for patients with no heart disease
disease). The left most columns show the values pre- Patients with no heart disease, 35 Incor-
predicted as having heart rect
dicted by the algorithms. The diagonal values show
GDM/AG

disease
correct predictions. Patients with no heart disease, 211 Correct
predicted as having no heart
disease
Patients with heart disease, 20 Incor-
predicted as having no heart rect
disease

Counts for traingdm Table 2 summarizes the results of all three algorithms.
Predicted 0 (Actual) 1 (Actual) The proposed algorithm (GDM/AG) appears to be
0 220 62 most effective as it has the highest percentage of cor-
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 57

rect prediction (90.3%) for patients with heart disease,

Generalization Accura-
Mean Number of epochs

Total CPU to converge


CPU time(s)/ Epoch
followed by GDM with difference of less than 2%)

Number of failures
and traingdm. However, traingdm appears to be most

cy (%)
effective for predicting patients with no heart disease
(89.4%) compared to other algorithms.

4.3 Verification Algorithms convergence speed

2.69 x 10-2
Traingdm

43.30

94.01
3467

14
For each training datasets, 100 different trials were
run, each with different initial random set of weights.
For each run, the number of iterations required for
convergence is reported. For an experiment of 100

4.59 x 10-2
runs, the mean of the number of iterations, the stan-

34.60

94.32
989

4
dard deviation, and the number of failures are col-

GDM
lected. A failure occurs when the network exceeds the
maximum iteration limit; each experiment is run to
one thousand iterations except for back propagation

3.69 x 10-2
GDM/AG
which is run to ten thousand iterations; otherwise, it

21.42

94.45
487

3
is halted and the run is reported as a failure. Conver-
gence is achieved when the outputs of the network
conform to the error criterion as compared to the de-
sired outputs.
Table 3 shows that the proposed algorithm
(GDM/AG) outperforms other algorithms in term of
3500 CPU time and number of epochs. The proposed algo-
3000 rithm (GDM/AG) only required 487 epochs in 21.4203
seconds of CPU times to achieve the target error,
2500
whereas GDM required 989 epochs in 34.5935 seconds
Epochs

2000 of CPU times. As we can see that the number of suc-


1500 cess rate for the proposed algorithm (GDM/AG) was
97% as compared to GDM in learning the patterns.
1000
Furthermore, the average number of learning itera-
Mean 500 tions for the proposed algorithm was reduced up to
STD
4
14
0 2.03 times faster as compared to GDM.
Failures 3
traingdm
GDM
GDM/AG

5 CONLUSION
Fig. 2. 3D plot results for Heart disease classification
problem A novel approach is presented in this paper for im-
proving the training efficiency of back propagation
Figure 2 shows the 3D plot for the results of the Heart neural network algorithms by adaptively modifying
Disease classification problem. The proposed algo- the initial search direction. The proposed algorithm
rithms (GDM/AG) show better results because it uses the gain value to modify the initial search direc-
converges in smaller number of epochs as suggested tion. The proposed algorithm is generic and has been
by the low value of the mean. Furthermore, the num- implemented in all commonly used gradient based
ber of failures for GDM/AG is lower as compared to optimization processes. Classification Matrix methods
other two algorithms. This makes the GDM/AG algo- are used to evaluate the effectiveness and the conver-
rithm a better choice for this problem since it had only gence speed of the proposed algorithm. All three al-
3 failures for the 100 different runs.
gorithms are able to extract patterns in response to the
predictable state. The most effective algorithm to pre-
TABLE 3. Algorithms Results
dict patients who are likely to have a heart disease
appears to the proposed method (GDM/AG) fol-
Heart Diesease classification
Problem (Target error = 0.05) lowed by others two algorithms. The results showed
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 58

that the proposed algorithm is stable and has a poten- [8] S. M. Shamsuddin, M. Darus, and M. N. Sulaiman, Classifica-
tial to significantly enhance the computational effi- tion of Reduction Invariants with Improved Back Propagation.
IJMMS, 2002. 30(4): p. 239-247.
ciency of the training process.
[9] S. C. Ng, et al., Fast convergence for back propagation net-
work with magnified gradient function. Proceedings of the In-
ternational Joint Conference on Neural Networks 2003, 2003.
3: p. 1903-1908.
ACKNOWLEDGEMENT [10] R.A. Jacobs, Increased rates of convergence through learning
rate adaptation. Neural Networks, 1988. 1: p. 295–307.
[11] M.K. Weir, A method for self-determination of adaptive learn-
This work was supported by Universiti Tun Hussein
ing rates in back propagation. Neural Networks, 1991. 4: p.
Onn Malaysia (UTHM). 371-379.
[12] X.H. Yu, G.A. Chen, and S.X. Cheng, Acceleration of backpro-
pagation learning using optimized learning rate and momen-
tum. Electronics Letters, 1993. 29(14): p. 1288-1289.
[13] Bishop C. M., Neural Networks for Pattern Recognition. 1995:
REFERENCES Oxford University Press.
[14] R. Fletcher and M. J. D. Powell, A rapidly convergent descent
[1] A. van Ooyen and B. Nienhuis.: Improving the convergence of method for minimization. British Computer J., 1963: p. 163-
the back-propagation algorithm. Neural Networks, 1992. 5: p. 168.
465-471. [15] Fletcher R. and Reeves R. M., Function minimization by con-
[2] M. Ahmad and F.M.A. Salam, Supervised learning using the jugate gradients. Comput. J., 1964. 7(2): p. 149-160.
cauchy energy function. International Conference on Fuzzy [16] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients
Logic and Neural Networks, 1992. for solving linear systerns. J. Research NBS, 1952. 49: p. 409.
[3] Pravin Chandra and Yogesh Singh, An activation function [17] HUANG H.Y., A unified approach to quadratically conver-
adapting training algorithm for sigmoidal feedforward net- gent algorithms for function minimization. J. Optim. Theory
works. Neurocomputing, 2004. 61: p. 429-437. Appl., 1970. 5: p. 405-423.
[4] Krzyzak A., Dai W., and Suen C. Y., Classification of large set [18] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning
of handwritten characters using modified back propagation internal representations by error propagation. in D.E. Rumel-
model. Proceedings of the International Joint Conference on hart and J.L. McClelland (eds), Parallel Distributed Processing,
Neural Networks, 1990. 3: p. 225-232. 1986. 1: p. 318-362.
[5] Sang Hoon Oh, Improving the Error Backpropagation Algo- [19] N. M. Nawi, M. R. Ransing, and R. S. Ransing: “An improved
rithm with a Modified Error Function. IEEE TRANSACTIONS Conjugate Gradient based learning algorithm for back propa-
ON NEURAL NETWORKS, 1997. 8(3): p. 799-803. gation neural networks”, International Journal of Computa-
[6] Hahn-Ming Lee, Tzong-Ching Huang, and Chih-Ming Chen, tional Intelligence, March 2007, Vol. 4, No. 1, pp. 46-55.
Learning Efficiency Improvement of Back Propagation Algo- [20] Blake C. L.., UCI Machine Learning Databases,
rithm by Error Saturation Prevention Method. IJCNN '99, http://mlearn.ics.uci.edu/database/heart-disease/, 2004.
1999. 3: p. 1737-1742.
[7] Sang-Hoon Oh and Youngjik Lee, A Modified Error Function
to Improve the Error Back-Propagation Algorithm for Multi-
Layer Perceptrons. ETRI Journal, 1995. 17(1): p. 11-22.
Computer Science and Information Technology, Universiti Tun Hus-
sein Onn Malaysia (UTHM) since 2001. He had bachelor’s degree
in Computer Science from Universiti Putra Malaysia (UPM). He
received a Master Degree in Computer Science in Information Sys-
tem from Universiti Teknologi Malaysia (UTM). His Ph.D in decision
tree modeling with incomplete information in classification task
problem in Data Mining from Universite De La Rochelle, France.
Nazri Mohd Nawi received his B.S.degree in Computer Science His research interests including rough set theory, artificial intelli-
from University of Science Malaysia (USM), Penang, Malaysia. His gence in data mining and knowledge discovery.
M.Sc.degree in computer science was received from University of
Technology Malaysia (UTM), Skudai,Johor, Malaysia. He received
his Ph.D. degree in Mechanical Engineering department, Swansea
University, Wales Swansea.He is currently a seniour lecturer in
Software Engineering Depart ment at Universiti Tun Hussein Onn
Malaysia (UTHM). His research interests are in optimization, data-
mining techniques and neural networks.

Rozaida Ghazali received her B.Sc. (Hons) degree in Computer


Sciencefrom Universiti Sains Malaysia, and M.Sc. degree in Com-
puter Sciencefrom Universiti Teknologi Malaysia. She obtained her
Ph.D.degree in Higher Order Neural Networks for Financial Time
series Prediction at Liverpool John Moores University, UK. She is
currently a seniour lecturer at Faculty of Computer Science and
Information Technology, Universiti Tun Hussein Onn Malaysia
(UTHM).Her research area includes neuralnetworks, financial time
series prediction and physical timeseries forecasting.

Mohd Najib Mohd Salleh is currently a senior lecturer at Faculty of

You might also like