JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.

COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 53

Predicting Patients with Heart Disease by Using an Improved Back-propagation Algorithm
Nazri Mohd Nawi, Rozaida Ghazali and Mohd Najib Mohd Salleh
Abstract— A study on improving training efficiency of Artificial Neural Networks algorithm was carried out throughout many previous works. This paper presents a new approach to improve the training efficiency of back propagation neural network algorithms. The proposed algorithm (GDM/AG) adaptively modifies the gradient based search direction by introducing the value of gain parameter in the activation function. It has been shown that this modification significantly enhance the computational efficiency of training process. The proposed algorithm is generic and can be implemented in almost all gradient based optimization processes. The robustness of the proposed algorithm is shown by comparing convergence rates and the effectiveness of gradient descent methods using the proposed method on heart disease data. Index Terms— Back propagation, Search direction, adaptive gain, effectiveness, computational efficiency.

——————————  ——————————

1 INTRODUCTION

T

he back-propagation algorithm has been the most popular and most widely implemented algorithm for training these types of neural network. When using the back-propagation algorithm to train a multilayer neural network, the designer is required to arbitrarily select parameter such as the network topology, initial weights and biases, a learning rate value, the activation function, and a value for the gain in the activation function. Improper selection of any of these parameters can result in slow convergence or even network paralysis where the training process comes to a virtual standstill. Another problem is the tendency of the steepest descent technique, which is used in the training process, can easily get stuck at local minima. Recently, improving training efficiency of backpropagation neural network based algorithm is an active area of research and numerous papers have been proposed in the literature. Early research on back propagation algorithms saw improvements on: (i) selection of better error functions [1-8]; (ii) different choices for activation functions [3, 9] and, (iii) selection of dynamic learning rate and momentum [10-12]. Later, as summarized by Bishop [13], various optimization techniques were suggested for improving efficiency of the error minimization process or in other words the training efficiency. Among these are me————————————————

thods of Fletcher and Powel [14] and the FletcherReeves [15] that improve the conjugate gradient method of Hestenes and Stiefel [16] and the family of Quasi-Newton algorithms proposed by Huang [17]. This research suggests that a simple modification to the gradient based search direction used by almost all optimization method that has been summarized by Bishop [13] can substantially improve the training efficiency. The gradient based search direction is locally modified by a gain value used in the activation function of the corresponding node to improve the convergence rates respective of the optimization algorithm used. The remaining of the paper is organized as follows: Section two states the research objectives. Section three illustrates the proposed method and the implementation of the proposed method in gradient descent optimization process. In Section four, the robustness of proposed algorithm is shown by comparing convergence rates for gradient descent methods on Cleveland Heart Disease data. The paper is concluded in the final section along with short discussion on further research.

2. RESEARCH OBJECTIVES
This research will demonstrate the robustness of the proposed algorithm by comparing its convergence rates in predicting patients diagnosed with heart disease. The data used in this research is based on Cleveland Heart Disease data. The proposed algorithm significantly enhances the computational efficiency of the training process. The proposed algorithm is generic and can be implemented in almost all gradient

• N.Mohd Nawi is with Universiti tun Hussein Onn Malaysia, 86400, Parit Raja, Batu Pahat, MALAYSIA. • R. Ghazali is with Universiti tun Hussein Onn Malaysia, 86400, Parit Raja, Batu Pahat, MALAYSIA. • M.N. Mohd Salleh is with Universiti tun Hussein Onn Malaysia, 86400, Parit Raja, Batu Pahat, MALAYSIA.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 54

based optimization processes. This research will improve data mining techniques particularly Artificial Neural Network (ANN) in extracting hidden knowledge (patterns and relationship) associated with heart disease from a historical heart disease database efficiently.

3.

THE PROPOSED METHOD

In this section, a novel approach for improving the training efficiency of back propagation neural network algorithms is proposed. The proposed algorithm modifies the initial search direction by changing the gain value adaptively for each node. The following subsection describes the algorithm. The advantages of using an adaptive gain value have been explored. Gain update expressions as well as weight and bias update expressions for output and hidden nodes have also been proposed. These expressions have been derived using same principles as used in deriving weight updating expressions. The following iterative algorithm has been proposed for changing the gradient based search direction using a gain value. Initialize the initial weight vector with random values and the vector of gain values with unit values. Repeat the following steps 1 and 2 on an epoch-by-epoch basis until the given error minimization criteria are satisfied. Step 1 By introducing gain value into activation function, calculate the gradient of error with respect to weights by using Equation (5), and gradient of error with respect to the gain parameter by using Equation (7) Step 2 Use the gradient weight vector and gradient of gain vector calculated in step 1 to calculate the new weight vector and vector of new gain values for use in the next epoch.

layer s , and let o = [o ...o ] be the column vector of activation values in the layer s and the input s layer as layer 0. Let wij be the weight values for the connecting link between the i th node in layer s − 1 and the j th node in layer s , and let s w s = [ w1s j ...wnj ]T be the column vector of weights j from layer s − 1 to the j th node of layer s . The net input to the j th node of layer s is defined as s and let net s = ( w s , o s −1 ) = k w s ,k ok −1 , j j j s s s T net = [net1 ...net n ] be the column vector of the net input values in layer s . The activation value for a node is given by a function of its net inputs and the gain parameter c sj ; (2) o sj = f (c sj net s ) j
s s 1 s T n

where f is any function with bounded derivative. This information is now used to derive an expression for modifying gain values for the next epoch. Most of gradient based optimization methods use the following gradient descent rule: ∂E (3) ( ∆wijn ) = −η ( n ) ( ∂wijn ) where η (n ) is the learning rate value at step n and the gradient based search direction at step n is ∂E d (n) = − (n) = g (n) . ∂proposed method the gradient based In the wij search direction is modified by including the variation of gain value to yield ∂E (4) d ( n ) = − ( n ) (c (jn ) ) = g ( n ) (c (jn ) ) ∂wij The derivation of the procedure for calculating the gain value is based on the gradient descent algorithm. The error function as defined in Equation (1) is diffes rentiated with respect to the weight value wij . The chain rule yields, s s ∂E ∂E ∂net s +1 ∂o j ∂net j . . . = s s ∂wij ∂net s +1 ∂o sj ∂net s ∂wij j (5)  w1s +1  j  s +1 s +1  s s s s −1 = [−δ1 ... − δ n ]. . f ' (c j net j )c j .o j  w s +1   nj  where δ s = − ∂E . In particular, the first three j ∂net s factors of Equation (5) indicate that the followj ing equation holds:

3.1 Derivation of the expression to calculate gain value Consider a multilayer feed-forward network, as used in standard back propagation algorithm[18]. Suppose that for a particular input pattern o 0 , the desired output is the teacher pattern t = [t1...t n ]T , and L the actual output is ok , where L denotes the output layer. The error function on that pattern is defined as, 1 (1) E= (t − o L ) 2
2

δ1s = (∑ δ ks +1wks +j1 ) f ' (c sj net sj )c sj ,
k

(6)

k

k

k

s Let ok be the activation values for the k th node of

It should be noted that, the iterative formula as described in Equation (6) to calculate δ1s is the same as used in the standard back propagation algorithms [18] except for the appearance of the gain value in the expression. The learning rule for calculating weight values as given in Equation (3) is derived by combining (5) and (6). In this approach, the gradient of error with respect

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 55

to the gain parameter can also be calculated by using the chain rule as previously described; it is easy to compute as ∂E (7) = (∑ δ ks +1 wks ,+j1 ) f ' (c sj net s )net s j j ∂c sj k Then the gradient descent rule for the gain value becomes, (8) net s j ∆ c sj = ηδ js s cj At the end of every epoch the new gain value is updated using a simple gradient based method as given by the following formula, (9) c new = c old + ∆c sj j j

dataset (364 records). The records for each set were selected randomly in order to avoid bias. For consistency, only categorical attributes are used for neural networks model. All the medical attributes in Table 1 were transformed from numerical to categorical data. The attribute “Diagnosis” was identified as the predictable attribute with value ‘1’ for patients with heart disease and value ‘0’ for patients with no heart disease. The attribute ‘PatientID’ was used as the key, the rest were used as input attributes. It is assume that missing, inconsistent and duplicate data have been resolved.
TABLE 1. Description of attributes

3.2 Implementation of the proposed method with gradient descent method In gradient descent method, the search direction at each step is given by the local negative gradient of the error function, and the step size is determined by a learning rate parameter. Suppose at step n in gradient descent algorithm, the current weight vector is w n , and a particular gradient based search direction is d n . The weight vector at step n+1 is computed by the following expression:

Predictable Attribute 1. Diagnosis (Value 0: <50% diameter narrowing (no heart disease); value 1:> 50% diameter narrowing (has heart disease)) Key Attribute 1. PatientID – Patient’s identification number Input Attributes 1. Sex (value 1: Male; Value 0: Female) 2. Chest Pain Type (Value 1: Typical type 1 angina, Value 2: typical type angina, Value 3: non-angina pain; Value 4: asymptomatic) Fasting Blood Sugar (Value 1:> 120 mg/dll; value 0: < 120 mg/dll) Restecg – resting electrographic results (value0: normal; value 1: 1 having ST-T wave abnormality; value 2: showing probable or define left ventricular hypertrophy) Exang – exercise induced angina (value 1: YES; value 0: NO) Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat; value 3: downsloping) CA – number of major vessels colored by floursopy (value 0-3) Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect) Trest Blood Pressure (mm Hg on admission to the hospital) Serum Chlolestoral (mg/dll) Thalach – maximum heart rate achieved Oldpeak – ST depression induced by exercise relative to rest Age in Year

w ( n +1) = w n + η n d n
n

(10)

3. 4.

where , η is the learning rate value at step n . By using the proposed method namely as Backpropagation Gradient Descent Method with Adaptive Gain Variation (BPGD/AG) [19], the gradient based search direction is calculated at each step by using Equation (4).

5. 6.

7. 8. 9. 10. 11.

4.

RESULTS AND DISCUSSIONS

4.1 Preliminaries The performance criteria used to asses the result of proposed method focuses on; (i) the speed of convergence, measured in number of iterations as well as the corresponding CPU time and (ii) the effectiveness of models that gave the highest percentage of correct predictions for diagnosing patients with heart disease. A total of 909 records with 15 attributes (factors) were obtained from the Cleveland Heart Disease database [20]. The records were split equally into two datasets: training dataset (545 records) and testing

12. 13.

The simulations have been carried out on a Pentium IV with 3 GHz PC, 1 GB RAM and using MATLAB version 6.5.0 (R13). The following three algorithms were analyzed and simulated on the datasets. 1) The standard gradient descent with momentum (traingdm) from ‘Matlab Neural Network Toolbox version 4.0.1’.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 56

The standard Gradient descent with momentum (GDM) 3) The Gradient descent with momentum and Adaptive Gain (GDM/AG) For comparison with other standard optimization algorithms from the MATLAB neural network toolbox, network parameters such as network size and architecture (number of nodes, hidden layers etc), values for the initial weights and gain parameters were kept same. The research only focused on the neural network with one hidden layer with five hidden nodes and sigmoid activation function was used for all nodes. All algorithms were tested using the same initial weights that were initialized randomly from range [0, 1] and received the input patterns for training in the same sequence. For gradient descent algorithm, the learning rate value was 0.3 and the momentum term value was 0.7. The initial value used for the gain parameter was one. For each run, the numerical data is stored in two files- the results file, and the summary file. The result file lists data about each network. The number of iterations until convergence is accumulated for each algorithm from which the mean, the standard deviation and the number of failures are calculated. The networks that fail to converge are obviously excluded from the calculations of the mean and standard deviation but are reported as failures.

2)

1

26

146

Counts for GDM Predicted 0 (Actual) 0 211 1 35

1 (Actual) 24 184

Counts for GDM/AG Predicted 0 (Actual) 1 (Actual) 0 211 20 1 35 188 Fig.1. Results of Classification Matrix for the three algorithms

TABLE 2. Algorithms Results
Model Description Patients with heart disease, predicted as having heart disease Patients with no heart disease, predicted as having heart disease Patients with no heart disease, predicted as having no heart disease Patients with heart disease, predicted as having no heart disease Patients with heart disease, predicted as having heart disease Patients with no heart disease, predicted as having heart disease Patients with no heart disease, predicted as having no heart disease Patients with heart disease, predicted as having no heart disease Patients with heart disease, predicted as having heart disease Patients with no heart disease, predicted as having heart disease Patients with no heart disease, predicted as having no heart disease Patients with heart disease, predicted as having no heart disease No. of cases 146 Prediction correct

26

Traingdm

Incorrect Correct

220

62

Incorrect correct

4.2 Validating Algorithm Effectiveness The effectiveness of each algorithm was tested using Classification Matrix which displays the frequency of correct and incorrect predictions by comparing the actual values in the test dataset with the predicted values in the trained algorithm. In the example, the test dataset contained 208 patients with heart disease and 246 patients without heart disease. Figure 1 shows the results of the classification matrix for all three algorithms. The rows represent predicted values while the columns represent actual values (‘1’ for patients with heart disease, ‘0’ for patients with no heart disease). The left most columns show the values predicted by the algorithms. The diagonal values show correct predictions.

184

35

GDM

Incorrect Correct

211

24

Incorrect correct

188

35

GDM/AG

Incorrect Correct

211

20

Incorrect

Counts for traingdm Predicted 0 (Actual) 0 220

1 (Actual) 62

Table 2 summarizes the results of all three algorithms. The proposed algorithm (GDM/AG) appears to be most effective as it has the highest percentage of cor-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 57

Mean Number of epochs

Generalization Accuracy (%) 94.01 94.32 94.45

Total CPU to converge

CPU time(s)/ Epoch

Traingdm

2.69 x 10-2

43.30

4.3 Verification Algorithms convergence speed For each training datasets, 100 different trials were run, each with different initial random set of weights. For each run, the number of iterations required for convergence is reported. For an experiment of 100 runs, the mean of the number of iterations, the standard deviation, and the number of failures are collected. A failure occurs when the network exceeds the maximum iteration limit; each experiment is run to one thousand iterations except for back propagation which is run to ten thousand iterations; otherwise, it is halted and the run is reported as a failure. Convergence is achieved when the outputs of the network conform to the error criterion as compared to the desired outputs.

3467

4.59 x 10-2

34.60

989

GDM

GDM/AG

3.69 x 10-2

21.42

487

3500 3000 2500 2000 1500 1000 Mean STD Failures 3
GDM/AG

500 4
GDM

14
traingdm

0

Table 3 shows that the proposed algorithm (GDM/AG) outperforms other algorithms in term of CPU time and number of epochs. The proposed algorithm (GDM/AG) only required 487 epochs in 21.4203 seconds of CPU times to achieve the target error, whereas GDM required 989 epochs in 34.5935 seconds of CPU times. As we can see that the number of success rate for the proposed algorithm (GDM/AG) was 97% as compared to GDM in learning the patterns. Furthermore, the average number of learning iterations for the proposed algorithm was reduced up to 2.03 times faster as compared to GDM.

Epochs

5 CONLUSION
A novel approach is presented in this paper for improving the training efficiency of back propagation neural network algorithms by adaptively modifying the initial search direction. The proposed algorithm uses the gain value to modify the initial search direction. The proposed algorithm is generic and has been implemented in all commonly used gradient based optimization processes. Classification Matrix methods are used to evaluate the effectiveness and the convergence speed of the proposed algorithm. All three algorithms are able to extract patterns in response to the predictable state. The most effective algorithm to predict patients who are likely to have a heart disease appears to the proposed method (GDM/AG) followed by others two algorithms. The results showed

Fig. 2. 3D plot results for Heart disease classification problem

Figure 2 shows the 3D plot for the results of the Heart Disease classification problem. The proposed algorithms (GDM/AG) show better results because it converges in smaller number of epochs as suggested by the low value of the mean. Furthermore, the number of failures for GDM/AG is lower as compared to other two algorithms. This makes the GDM/AG algorithm a better choice for this problem since it had only 3 failures for the 100 different runs.
TABLE 3. Algorithms Results
Heart Diesease classification Problem (Target error = 0.05)

3

4

14

Number of failures

rect prediction (90.3%) for patients with heart disease, followed by GDM with difference of less than 2%) and traingdm. However, traingdm appears to be most effective for predicting patients with no heart disease (89.4%) compared to other algorithms.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 2, FEBRUARY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 58

that the proposed algorithm is stable and has a potential to significantly enhance the computational efficiency of the training process.

[8]

[9]

ACKNOWLEDGEMENT
This work was supported by Universiti Tun Hussein Onn Malaysia (UTHM).

[10] [11]

[12]

REFERENCES
[1] A. van Ooyen and B. Nienhuis.: Improving the convergence of the back-propagation algorithm. Neural Networks, 1992. 5: p. 465-471. M. Ahmad and F.M.A. Salam, Supervised learning using the cauchy energy function. International Conference on Fuzzy Logic and Neural Networks, 1992. Pravin Chandra and Yogesh Singh, An activation function adapting training algorithm for sigmoidal feedforward networks. Neurocomputing, 2004. 61: p. 429-437. Krzyzak A., Dai W., and Suen C. Y., Classification of large set of handwritten characters using modified back propagation model. Proceedings of the International Joint Conference on Neural Networks, 1990. 3: p. 225-232. Sang Hoon Oh, Improving the Error Backpropagation Algorithm with a Modified Error Function. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1997. 8(3): p. 799-803. Hahn-Ming Lee, Tzong-Ching Huang, and Chih-Ming Chen, Learning Efficiency Improvement of Back Propagation Algorithm by Error Saturation Prevention Method. IJCNN '99, 1999. 3: p. 1737-1742. Sang-Hoon Oh and Youngjik Lee, A Modified Error Function to Improve the Error Back-Propagation Algorithm for MultiLayer Perceptrons. ETRI Journal, 1995. 17(1): p. 11-22.

[13] [14]

[2]

[15] [16] [17]

[3]

[4]

[18]

[5]

[19]

[6]

[20]

S. M. Shamsuddin, M. Darus, and M. N. Sulaiman, Classification of Reduction Invariants with Improved Back Propagation. IJMMS, 2002. 30(4): p. 239-247. S. C. Ng, et al., Fast convergence for back propagation network with magnified gradient function. Proceedings of the International Joint Conference on Neural Networks 2003, 2003. 3: p. 1903-1908. R.A. Jacobs, Increased rates of convergence through learning rate adaptation. Neural Networks, 1988. 1: p. 295–307. M.K. Weir, A method for self-determination of adaptive learning rates in back propagation. Neural Networks, 1991. 4: p. 371-379. X.H. Yu, G.A. Chen, and S.X. Cheng, Acceleration of backpropagation learning using optimized learning rate and momentum. Electronics Letters, 1993. 29(14): p. 1288-1289. Bishop C. M., Neural Networks for Pattern Recognition. 1995: Oxford University Press. R. Fletcher and M. J. D. Powell, A rapidly convergent descent method for minimization. British Computer J., 1963: p. 163168. Fletcher R. and Reeves R. M., Function minimization by conjugate gradients. Comput. J., 1964. 7(2): p. 149-160. M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systerns. J. Research NBS, 1952. 49: p. 409. HUANG H.Y., A unified approach to quadratically convergent algorithms for function minimization. J. Optim. Theory Appl., 1970. 5: p. 405-423. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning internal representations by error propagation. in D.E. Rumelhart and J.L. McClelland (eds), Parallel Distributed Processing, 1986. 1: p. 318-362. N. M. Nawi, M. R. Ransing, and R. S. Ransing: “An improved Conjugate Gradient based learning algorithm for back propagation neural networks”, International Journal of Computational Intelligence, March 2007, Vol. 4, No. 1, pp. 46-55. Blake C. L.., UCI Machine Learning Databases, http://mlearn.ics.uci.edu/database/heart-disease/, 2004.

[7]

Nazri Mohd Nawi received his B.S.degree in Computer Science from University of Science Malaysia (USM), Penang, Malaysia. His M.Sc.degree in computer science was received from University of Technology Malaysia (UTM), Skudai,Johor, Malaysia. He received his Ph.D. degree in Mechanical Engineering department, Swansea University, Wales Swansea.He is currently a seniour lecturer in Software Engineering Depart ment at Universiti Tun Hussein Onn Malaysia (UTHM). His research interests are in optimization, datamining techniques and neural networks. Rozaida Ghazali received her B.Sc. (Hons) degree in Computer Sciencefrom Universiti Sains Malaysia, and M.Sc. degree in Computer Sciencefrom Universiti Teknologi Malaysia. She obtained her Ph.D.degree in Higher Order Neural Networks for Financial Time series Prediction at Liverpool John Moores University, UK. She is currently a seniour lecturer at Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM).Her research area includes neuralnetworks, financial time series prediction and physical timeseries forecasting. Mohd Najib Mohd Salleh is currently a senior lecturer at Faculty of

Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM) since 2001. He had bachelor’s degree in Computer Science from Universiti Putra Malaysia (UPM). He received a Master Degree in Computer Science in Information System from Universiti Teknologi Malaysia (UTM). His Ph.D in decision tree modeling with incomplete information in classification task problem in Data Mining from Universite De La Rochelle, France. His research interests including rough set theory, artificial intelligence in data mining and knowledge discovery.