You are on page 1of 10

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 1

Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
Mai Shouman, Tim Turner and Rob Stocker
AbstractHeart disease is the leading cause of death in the world over the past 10 years. The research presented here is part of work to develop tools to assist healthcare practitioners to diagnosis heart disease earlier in the hope of earlier interventions in this preventable killer. The relative accuracy of common data mining techniques in heart disease diagnosis is difficult to assess from the literature. This research investigates Decision Tree, Nave Bayes, and K-nearest Neighbour performance in the diagnosis of heart disease patients. It then further assesses performance enhancement through integrating clustering techniques. The testing was conducted over a standardized dataset used widely in the literature. The results show that integrating clustering with decision tree, nave bayes, and k nearest neighbour could enhance their accuracies in diagnosing heart disease patients. Importantly, the results establish that the ensemble of two cluster inlier k-means clustering with the knearest neighbour technique was the most effective at heart disease diagnosis.

Index Terms- Data Mining, Decision Tree, Nave Bayes, K-nearest Neighbour, K-Means Clustering, Heart Disease
Diagnosis.

1 INTRODUCTION

eart disease is the leading cause of death in the world over the past 10 years. Moreover, the World Health Organization has reported that heart disease is the leading cause of death in both high and low income countries [1]. The European Public Health Alliance reports that heart attacks and other circulatory diseases account for 41% of all deaths [2]. The Economical and Social Commission of Asia and the Pacific found that in one fifth of Asian countries, most lives are lost to non-communicable diseases such as cardiovascular, cancers, and diabetes diseases [3]. Statistics of South Africa report that heart and circulatory system diseases are the third leading cause of death in Africa [4]. The Australian Bureau of Statistics reported that heart and circulatory system diseases are the first leading cause of death in Australia, causing 33.7% of all deaths [5]. This high level of mortality is a tragedy; most particularly because heart diseases are typically eminently treatable if detected early [6]. Motivated by the increasing mortality of heart disease patients world-wide each year and the availability of a huge amount of patients data that could be used to extract useful knowledge, researchers have been using data mining techniques to help health care professionals in the diagnosis of heart disease patients [7-8]. Data mining is an essential step in knowledge discovery. It involves the exploration of large datasets to extract hidden and previously unknown patterns, relationships and knowledge that are difficult to detect with traditional statistical methods [9-13]. The application of data mining is rapidly spreading in a wide range of sectors such as analysis of organic compounds, financial forecasting, weather forecasting, and

healthcare [14]. Data mining in healthcare is an emerging field of high importance for providing prognosis and a deeper understanding of medical data. Researchers are using data mining techniques in the medical diagnosis of several diseases such as diabetes [15], stroke [16], cancer [17], and heart disease [18]. Several data mining techniques are used in the diagnosis of heart disease such as nave bayes, decision tree, knearest neighbor, neural network, kernel density, bagging algorithm, and support vector machine showing different levels of accuracies [19-24]. Recently, researchers are suggesting that integrating more than one data mining technique can enhance data mining techniques performance in the diagnosis of heart disease patients [18, 25-27]. Although heart diseases are among the most common chronic diseases causing a high incidence of death all over the world, they have also been identified among the most preventable ones [28]. Early detection and healthy behaviours play an important role in preventing and controlling the effects of these diseases [6, 28-29]. There is a need for accurate systematic tools that identify patients at high risk and provide information for early intervention of these diseases [30]. This research investigates applying different single and hybrid data mining techniques in the diagnosis of heart disease patients. It investigates if integrating clustering with different data mining techniques can provide better performance than single data mining techniques in the diagnosis of heart disease patients. The rest of the paper is divided as follows: the background section briefly reviews applying data mining techniques in the diagnosis of heart disease; the methodology section explains the different single

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 2

and hybrid data mining techniques used in the diagnosis of heart disease patients; the validation and evaluation section presents the procedure used to measure the stability of the proposed model; the heart disease data section introduces the de facto standard dataset used; and the results section presents the results of a systematic investigation of single and hybrid data mining techniques used in the diagnosis of heart disease patients which is followed by the discussion of the presented results. Finally, we draw conclusions from this preliminary investigation to emphasize the importance of this systematic study in establishing a baseline for research and practice, and point to future work.

2 BACKGROUND
Researchers have been investigating the use of statistical analysis and data mining techniques to help healthcare professionals in the diagnosis of heart disease for many years. Statistical analysis has identified the risk factors associated with heart disease to be age, blood pressure, smoking [31], cholesterol [32], diabetes [33], family history of heart disease [34], obesity, and lack of physical activity [35]. Knowledge of the risk factors associated with heart disease helps health care professionals to identify patients at high risk of having heart disease. Although statistical methods help in identifying the risk factors associated with heart disease, they are not sufficient in helping in the diagnosis of patients. Data mining can play an important role in the diagnosis of heart disease patients. Researchers have been applying different data mining techniques over different heart disease datasets to help health care professionals in the diagnosis of heart disease [18-19, 22-24, 36]. Unfortunately in general, the results of the different data mining research cannot be compared because they use different datasets [37]. However, over time, a benchmark data set has arisen in the literature: the Cleveland Heart Disease Dataset (CHDD) [38]. Researchers have been using the CHDD to apply different data mining techniques such as decision tree, nave bayes, bagging algorithm, support vector machine, and neural network showing different levels of accuracy [19, 23, 36, 39-40]. Recently, researchers are investigating whether integrating more than one data mining technique can enhance data mining performance in the diagnosis of heart disease patients on the CHDD. Polat et al., [41] used artificial immune recognition system (AIRS) and other research used fuzzy AIRS and k-nearest neighbour [26]. Kavitha et. al, presented a new system for the detection of heart disease that uses neural network and genetic algorithm for forward feedback. The results showed that the proposed hybridization is more stable [42] . This research investigates integrating clustering with different data mining techniques to identify if this integration can enhance their performance in the diagnosis of heart

disease patients. K-means clustering is one of the most popular and well known clustering techniques. Its simplicity and reliable behavior make it popular in many applications [43]. Initial centroid selection is a critical issue in k-means clustering and strongly affects its results [44]. This research investigates applying three common data mining techniques which are decision tree, nave bayes, and k-nearest neighbour as single data mining techniques. It then investigates integrating k-means clustering with different initial centroid selection methods as well as different numbers of clusters in the diagnosis of heart disease patients. Importantly, our research thoroughly investigates each hybridization technique, testing which technique and which initial centroid selection method can provide better performance in diagnosing heart disease patients and if applying different numbers of clusters can provide different performance in diagnosing heart disease patients.

3 METHODOLOGY
This section discusses the single and hybrid data mining techniques used in the diagnosis of heart disease patients which are shown in Figure 1. For the single data mining techniques decision tree, nave bayes, and k-nearest neighbour are discussed. For the hybridization technique of k-means clustering with five different initial centroid selection methods is discussed.

FIGURE 1: PROPOSED APPROACH

3.1 Single Data Mining Techniques 3.1.1 Decision Tree The decision tree technique cannot deal with continuous attributes so they need to be converted into discrete attributes, a process called discretization. Dougherty et al. carried out a comparative study between two unsupervised and two supervised discretization methods using 16 data sets showing that differences between the

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 3

classification accuracies achieved by different discretization methods are not statistically significant [43]. Equal frequency discretization is a popular and successful unsupervised discretization method [42]. Previous related research has shown that this discretization method provides marginally better accuracy when applied on the CHDD (author reference). Equal frequency discretization is used as a preprocessing step to convert the continuous heart disease attributes to discrete attributes. The entropy (Information Gain) approach selects the splitting attribute that minimizes the value of entropy, thus maximizing the Information Gain. To identify the splitting attribute of the Decision Tree, one must calculate the Information Gain for each attribute and then select the attribute that maximizes the Information Gain. The Information Gain for each attribute is calculated using the following formula [9, 45]: E = ! (1) ! !! Pi log ! Pi Where k is the number of classes of the target attribute Pi is the number of occurrences of class i divided by the total number of instances (i.e. the probability of i occurring). To reduce the effect of bias resulting from the use of Information Gain, a variant known as Gain Ratio was introduced by Quinlan [45]. The Information Gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes having a large number of values [9]. Gain Ratio adjusts the Information Gain for each attribute to allow for the breadth and uniformity of the attribute values. Gain Ratio = Information Gain / Split Information (2) Where the Split Information is a value based on the column sums of the frequency table [45]. After extracting the decision tree rules, reduced error pruning is used to prune the extracted decision rules. Reduced error pruning is one of the fastest pruning methods and known to produce both accurate and small decision rules [46]. Applying reduced error pruning provides more compact decision rules and reduces the number of extracted rules.

the target attribute value is calculated using the following formula: P v = c! = P c!


! !! !

P a! = v! class = c! (3)

Where v is the testing instance, ci is the target attribute value, aj is a data attribute and vj is its value [45].

3.1.3 K-Nearest Neighbour K-nearest neighbour is one of the most simple and straight forward data mining techniques. It is called Memory-Based Classification as the training examples need to be in the memory at run-time [47]. If a is the first instance denoted by (a1, a2, a3,an) and b is the second instance denoted by (b1, b2, b3, bn), the distance between them is calculated by the following formula:
(a! b! )! + a! b!
!

+ (a ! b ! )!

(4)

K-nearest neighbour usually deals with continuous attributes, however, it can also deal with discrete attributes. When dealing with discrete attributes, if the attribute values for the two instances a2, b2 are different, the difference between them is equal to one, otherwise it is equal to zero. When dealing with continuous attributes, the difference between the attributes is calculated using the Euclidean distance (Equation 4). A major problem when dealing with the Euclidean distance formula is that the large value frequency swamps the smaller ones. For example, in heart disease records, the cholesterol measure mg/dl ranges between 100 and 190 while the age measure years range between 40 and 80. So, the influence of the cholesterol measure will be greater than the age. To overcome this problem, the continuous attributes are normalized so that they have the same influence on the distance measure between instances. To normalize the value a1 of the attribute A a1= (a1- min) / (max min) (5)

Where min is the minimum value of the attribute A, and max is the maximum value of the attribute A [45].

3.1.2 Nave Bayes Nave bayes is one of the data mining techniques that shows considerable success in classification problems and specially in diagnosing heart disease patients [19, 22]. Nave bayes is based on probability theory and seeks to find the most likely possible classification [43]. Classification is based on prior probability of the target attribute and the conditional probability of the remaining attributes. For the training data, the prior and conditional probability is calculated for each cluster. For each testing instance in the testing dataset, the probability is calculated with each of the target attribute values and the target attribute value with the largest probability is then selected. The probability of the testing instance for

3.2 Hybrid K-Means Clustering Data Mining Techniques K-means clustering is one of the most popular and well known clustering techniques because of its simplicity and good behavior in many applications [43, 45]. Several researchers have identified that age, blood pressure and cholesterol are critical risk factors associated with heart disease [31, 34-35]. These attributes are obvious clustering attributes for heart disease patients. The number of clusters used in the k-means in this investigation ranged between two and six. Initial centroid selection is a critical factor that strongly affects k-means clustering performance. Selecting the initial centroids in an intelligent way helps optimize the performance of the k-means

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 4

clustering algorithm [44, 48-50]. Researchers have been investigating enhancing k-means clustering performance through intelligent initial centroid selection methods with various techniques. Poomagal and Hamsapriya proposed optimizing k-means clustering by selecting the initial centroids in an intelligent mathematical model showing that the proposed method has produced high quality clusters results [49]. Immaculate Mary and Kasmir Raja proposed improving k-means clustering through the integration of ant colony optimization. The proposed model creates the initial centroids based on the mode value of the data then applies the k-means algorithm. After that, the ant colony optimization algorithm is applied to refine the clusters quality [51]. Pavan, Rao et al. proposed single-pass seed selection initial centroid selection method that produces single, optimal solution which is outlier insensitive. The algorithm is an extension to k-means in the way of choosing initial seeds with specific probabilities showing effectiveness in the clustering results [48]. Although researchers are investigating enhancing initial centroid selection methods for k-means clustering through various techniques, there is no previous research done on enhancing initial centroid selection methods for k-means clustering in the diagnosis of heart disease patients. Consequently, this research investigates integrating k-means clustering with different conventional initial centroid selection methods with decision tree, nave bayes, and k-nearest neighbour in the diagnosis of heart disease patients. The generation of initial centroids used in this research is based on actual sample data points using Inlier Method, Outlier Method, Range Method, Random Attribute Method, and Random Row Method [52]. The difference between these initial centroid selection methods is discussed in the following sections below.

range method, the following equations are used: ci = ((Max (X) Min (X)) / K) * n where 0 i k (10) cj = ((Max (Y) Min (Y)) / K) * n where 0 j k (11) The initial centroid is C (ci, cj). Max (X) and min (X) are maximum and minimum values of attribute X, and max (Y) and min (Y) are maximum and minimum values of attribute Y. K represents the number of clusters and n vary from 1 to k.

3.2.4 Random Attribute Method Initial Centroid Selection In generating the initial K centroids using the random attribute method, the following equations are used:
ci = random(X) where 1 i k cj = random(Y) where 1 j k (12) (13)

The initial centroid is C (ci, cj). The values of i and j vary from 1 to k.

3.2.5 Random Row Method Initial Centroid Selection In generating the initial K centroids using the random row method the following equations are used:
I = random (V) where 1 V N ci = X (I) (15) cj = Y (I) (16) (14)

The initial centroid is C (ci, cj). N is the number of instances in the training dataset. X (I) and Y(I) are the values of the attributes X and Y, respectively for the instance I. For the random attribute and random row methods, ten runs are executed and the average and best for each method are calculated and used as the results.

3.2.1 Inlier Method Initial Centroid Selection In generating the initial K centroids using the inlier method, the following equations are used:
ci = Min (X)i where 0 i k cj = Min (Y)j where 0 j k (6) (7)

4 VALIDATION AND EVALUATION


To measure the stability of the proposed model, the data is divided into training and testing data with 10-fold cross validation. To evaluate the performance of the proposed model the sensitivity, specificity, and accuracy are calculated. The sensitivity is the proportion of positive instances that are correctly classified as positive (e.g. the proportion of sick people that are classified as sick). The specificity is the proportion of negative instances that are correctly classified as negative (e.g. the proportion of healthy people that are classified as healthy). The accuracy is the proportion of instances that are correctly classified [45]. Sensitivity = True Positive / Positive (17) Specificity = True Negative / Negative (18) Accuracy = (True Positive + True Negative) / (Positive + Negative) (19) The average and standard deviation of the sensitivity, specificity, and accuracy of the different

Where the initial centroid is C (ci, cj) and min (X) and min (Y) are the minimum value of attribute X, and attribute Y, respectively. K represents the number of clusters.

3.2.2 Outlier Method Initial Centroid Selection In generating the initial K centroids using the outlier method, the following equations are used:
ci = Max (X)i where 0 i k cj = Max (Y)j where 0 j k (8) (9)

Where the initial centroid is C (ci, cj) and max (X) and max (Y) are the maximum value of attribute X, and attribute Y, respectively. K represents the number of clusters.

3.2.3 Range Method Initial Centroid Selection In generating the initial K centroids using the

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 5

single and hybrid data mining techniques are calculated.

6 RESULTS
The results section will discuss the achieved results when integrating clustering with decision tree, nave bayes and k-nearest neighbour techniques in the diagnosis of heart disease patients.

5 HEART DISEASE DATA


The data used in this study is the Cleveland Clinic Foundation Heart disease data set available at http://archive.ics.uci.edu/ml/datasets/Heart+Disease. The data set has 76 raw attributes. However, all of the published experiments only refer to 13 of them. The data set contains 303 rows of which 297 are complete. Six rows contain missing values and they are removed from the experiment. The attributes used in this study are shown in Table 1. In the type attribute C represents continuous and D represent discrete. TABLE 1: SELECTED CLEVELAND HEART DISEASE DATA SET ATTRIBUTES
Name Age Sex Cp Type C D D Description Age in years 1 = male 0 = female Chest pain type: 1 = typical angina 2 = atypical angina 3 = non-anginal pain 4 =asymptomatic Resting blood pressure (in mm Hg) Serum cholesterol in mg/dl Fasting blood sugar > 120 mg/dl: 1 = true 0 = false Resting electrocardiographic results: 0 = normal 1 = having ST-T wave abnormality 2 =showing probable or define left ventricular hypertrophy Maximum heart rate achieved Exercise induced angina: 1 = yes 0 = no Depression induced by exercise relative to rest The slope of the peak exercise segment : 1 = up sloping 2 = flat 3= down sloping Number of major vessels colored by fluoroscopy that ranged between 0 and 3. 3 = normal 6= fixed defect 7= reversible defect Diagnosis classes: 0 = healthy 1= patient who is subject to possible heart disease

6.1 Single Data Mining TechniquesThe results of sensitivity, specificity, and accuracy in the diagnosis of heart disease using Gain Ratio Decision Trees, nave bayes, and k-nearest neighbor are shown in Table 2. Nave bayes is achieving the best results followed by k nearest neighbor and decision tree as shown in Figure 2. Nave bayes as a single technique is showing mean accuracy of 83.5% (standard deviation of 5.2%) as shown in Table 2.
TABLE 2: DIFFERENT SINGLE DATA MINING TECHNIQUES RESULTS
Technique Name Gain Ratio Decision Tree Nave Bayes KNN K=19 Sensitivity Mean St Dev 6.1 Specificity Mean St Dev 12.1 Accuracy Mean St Dev 5.8

75.6

81.6

79.1

Trestbps Chol Fbs

C C D

78 76.7

13.8 10.7

80.8 85.1

12.6 7.5

83.5 83.2

5.2 4.1

100 90
80

Restecg

70

Accuracy

60

50 40 30 20 10 0 Gain Ratio Nave Bayes KNN K=19

Thalach Exang

C D

Old peak ST Slope

C D

Different Single Data Mining Techniques


FIGURE 2: DIFFERENT SINGLE DATA MINING TECHNIQUES

Ca

Thal

Diagnosis

6.2 Clustering With Data Mining Techniques Different clustering columns are used involving age, blood pressure and cholesterol. Using the age as the clustering column showed the best results. The results of sensitivity, specificity, and accuracy in the diagnosis of heart disease using k-means clustering and different data mining techniques with different initial centroids selection methods and different numbers of clusters are presented next.

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 6

6.2.1 Integrating Clustering With Gain Ratio Decision TreeTable 3 presents the results of integrating gain ratio decision tree with k-means clustering with different initial centroid selection methods and with different numbers of clusters. Integrating gain ratio decision tree with k-means clustering could enhance gain ratio accuracy in the diagnosis of heart disease patients. The best results for the gain ratio decision tree is achieved by the two clusters inlier initial centroid selection method showing mean accuracy, of 81.2% (standard deviation of 6.2%) as shown in Table 3. It is showing 2.1% increase in mean accuracy when comparing it with the single gain ratio decision tree which is showing 79.1% as shown in Table 3. TABLE 3: INTEGRATING CLUSTERING WITH GAIN RATIO DECISION TREE
Sensitivity Technique /Clusters No Gain Ratio 2 Inlier 3 4 5 Outlier 2 3 4 5 2 Range 3 4 5 Random Row 2 3 4 5 2 3 4 5 Mean 75.6 75.9 75.8 71.5 71.5 76 75.8 73.2 73.2 76 76.2 71.9 71.9 76.1 69.8 73.2 69.3 74.3 70.5 68.1 69.3 St Dev 6.1 7.2 8.5 11.8 11.8 9.6 8.5 11.2 11.2 9.6 10.8 12 12 8.9 16.5 9 13.9 8 10.8 15.9 13.9 Specificity Mean 81.6 85.1 83.4 78.1 78.1 80.3 83.4 84.1 84.1 80.3 76 78.1 78.1 82.8 77 79.1 79.5 84 81.7 80.8 79.5 St Dev 12.1 11.4 14.6 12.3 12.3 13.4 14.6 11.8 11.8 13.4 13.2 12.3 12.3 9.6 13.1 10.5 9.6 12.2 12.4 11.3 9.6 Accuracy Mean 79.1 81.2 80.8 77.5 77.5 78.7 80.8 79.7 79.7 78.7 78.9 77.8 77.8 80.1 76.3 78.4 77.1 80.1 79.4 77 77.1 St Dev 5.8 6.2 8.6 8.6 8.6 6.7 8.6 8.2 8.2 6.7 8.5 8.6 8.6 4.7 9.3 8.3 7.8 7 8.8 9.1 7.8

Mean Accuracy
100

90 80 70
60

50 40 30
20

10 0 2 single 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

Inlier

Outlier

Range

Random Row

Random A ttribute

bayes with k-means clustering with different initial centroid selection methods and with different numbers of clusters. Integrating nave bayes with kmeans clustering could enhance nave bayes accuracy in the diagnosis of heart disease patients. The best results for the nave bayes is achieved by the three clusters random row initial centroid selection method showing mean accuracy, standard deviation of 84.8% and 4.8% as shown in Table 4. It is showing 1.3% increase in mean accuracy when comparing it with the single nave bayes which is showing 83.5% as shown in Table 4. TABLE 4: INTEGRATING CLUSTERING WITH NAVE BAYES
Sensitivity Technique /Clusters No Nave Bayes 2 Inlier 3 4 5 Outlier 2 3 4 5 2 Range 3 4 5 Random Row 2 3 4 5 2 3 4 5 Mean St Dev 13.8 22 14.7 13.5 13.5 15.2 14.7 17.4 17.4 15.2 13.1 13.5 13.5 14.6 14.4 19.6 14.8 15.2 17.6 20.5 14.8 Specificity Mean St Dev 12.6 12.2 12.6 13.1 13.1 13.1 12.6 12.9 12.9 13.1 14.6 13.1 13.1 12.7 13.2 13.1 13.1 12.5 14.1 12.3 13.1 Accuracy Mean St Dev 5.2 4.7 4.4 3.1 3.1 4.5 4.4 6.8 6.8 4.5 4.3 3.1 3.1 3.4 4.7 6.3 4.7 4.4 6 5.7 4.7

78 73.2 75.9 76.9 76.9 76.4 75.9 70.1 70.1 76.4 78 76.9 76.9 76.5 76.6 71.4 71.1 76.4 74.3 70.9 71.1

80.8 83.3 83.7 81.7 81.7 82.3 83.7 83.1 83.1 82.3 81.3 81.7 81.7 82 85 83 81.1 83.8 82.2 83.5 81.1

83.5 83.1 83.6 83.5 83.5 83.4 83.6 80.7 80.7 83.4 84.1 83.5 83.5 83.1 84.8 81.6 80.4 84 82.3 82 80.4

FIGURE 3: INTEGRATING CLUSTERING WITH GAIN RATIO DECISION TREE

6.2.2 Clustering and Nave Bayes


Table 4 presents the results of integrating nave

Figure 4 shows the means accuracy of integrating nave bayes with k means clustering with different initial centroid selection methods and with different numbers of clusters. There is not a specific trend

Random Attribute

Figure 3 shows the means accuracy of integrating gain ratio decision tree with k means clustering with different initial centroid selection methods and with different numbers of clusters. There is not a specific trend between increasing the number of clusters with integrating gain ratio decision tree and the increase/decrease of the accuracy as shown in Figure 3.

Random Attribute

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 7

Random Row

between increasing the number of clusters with integrating nave bayes and the increase/decrease of the accuracy as shown in Figure 4.
Mean Accuracy
100 90 80
70

4 5 2 3 4 5 2 3 4 5

50.5 67.4 67.2 57.3 57.5 50.5 66.5 50.5 56 52.7

15 13.2 11.9 16.2 17 15.9 14.9 15 16.1 21.4

81.1 71.6 79 82.9 78.9 81.8 71.4 80.7 80 75.6

8.5 11.2 12.9 11.1 14.2 8.7 11.2 8.7 16.2 13.5

70 72.1 76.7 72.8 74.1 70.4 70.7 69.7 70.1 67.8

7.8 6.2 5.7 8.2 5.9 7.5 5.2 7.9 7.4 6.4

60 50 40 30
20

Random Attribute

Mean Accuracy
100 90 80
2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

10 0

70

single

Inlier

Outlier

Range

Random Row

Random Attribute

60 50

FIGURE 4: INTEGRATING CLUSTERING WITH NAVE BAYES

40 30
20

6.2.3 Clustering and K-nearest Neighbour


Table 5 presents the results of integrating k nearest neighbour with k-means clustering with different initial centroid selection methods and with different numbers of clusters. Integrating k nearest neighbour with k-means clustering could enhance k nearest neighbour accuracy in the diagnosis of heart disease patients. The best results for the k nearest neighbour is achieved by the two clusters inlier initial centroid selection method showing mean accuracy, standard deviation of 85.7% and 5.4% as shown in Table 5. It is showing 2.5% increase in mean accuracy when comparing it with the single k nearest neighbour which is showing 83.2% as shown in Table 5. Figure 5 shows the means accuracy and standard deviation of integrating k nearest neighbour with k means clustering with different initial centroid selection methods and with different numbers of clusters. There is not a specific trend between increasing the number of clusters with integrating k nearest neighbour and the increase/decrease of the accuracy as shown in Figure 5. TABLE 5: INTEGRATING CLUSTERING WITH K NEAREST NEIGHBOUR
Technique /Clusters No KNN 2 Inlier 3 4 5 Outlier 2 3 4 5 Range 2 3 Sensitivity Mean 76.7 77.8 75.9 77.8 77.9 79.8 63.8 61.3 67.2 61.9 63.4 St Dev 10.7 13 17.7 13 12.9 11.7 15.7 13.6 11.9 14 14.2 Specificity Mean 85.1 88.8 84.2 88.8 86 86.6 78.1 76.1 79 77.4 79.2 St Dev 7.5 5.9 9.8 5.9 10.3 6.4 12.9 11.9 12.9 12.4 12.5 Accuracy Mean 83.2 85.7 83.6 85.7 84.7 85.7 75.3 72.8 76.7 74.2 75.2 St Dev 4.1 5.4 5.6 5.4 4.9 5.4 6.4 5.8 5.7 5.2 6.3

10 0
2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

single

Inlier

Outlier

Range

Random Row

Random Attribute

FIGURE 5: INTEGRATING CLUSTERING WITH K NEAREST NEIGHBOUR

7 DISCUSSION
Our systematic investigation seeks to establish a reference point for decision tree, nave bayes and knearest neighbour performance in the diagnosis of heart disease patients. It investigates different initial centroid selection methods as well as different numbers of clusters for k-means clustering with decision tree, nave bayes, and k-nearest neighbour. The research objective is to determine which hybrid model will provide better accuracy in the diagnosis of heart disease patients. When comparing the single data mining techniques, nave bayes showed the best accuracy in the diagnosis of heart disease patients followed by the k nearest neighbour and decision tree as shown in Figure 6. When comparing integrating k means clustering with different initial centroid selection methods, k nearest neighbor showed the best results followed by nave bayes and decision tree as shown in Figure 6. When comparing integrating k-means clustering with different data mining techniques, this integration could enhance data mining techniques performance in the diagnosis of heart disease patients as shown in Figure 6. Each data mining technique is shown in different color showing the different single and hybrid data mining techniques. The increase in accuracy between each single and hybrid data mining techniques varied from one technique to

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 8

another. The best increase in accuracy of the hybrid data mining technique is achieved by the k nearest neighbour showing 2.5% enhancement of the mean accuracy to increase from 83.2% to 85.7% as shown in Figure 6.

Single Decision Tree 2 Clusters Inlier Single Nave Bayes 3 Clusters Random Row Single K=19 2 Clusters Inlier 2 Clusters Inlier 3 Clusters Random Row 2 Clusters Inlier 2 Clusters Inlier

79.1 81.2 83.5

5.8 6.2 5.2 1.3 2.1

Yes (t=4.28, p<=0.05)

84.8

4.7

Yes (t=,3.21 p<=0.05)

83.2 85.7

4.1 5.4 2.5

Yes (t=6.38, p<=0.05)

K=19 And Nave Bayes

85.7

5.4 0.9 Yes (t=2.17, p<=0.05)

84.8

4.7

FIGURE6: SINGLE AND HYBRID DATA MINING TECHNIQUES Table 6 shows the increase in accuracy of different data mining techniques when integrated with k means clustering. The best increase in accuracy is achieved by k means clustering followed by gain ratio decision tree and nave bayes showing increase of 2.5%, 2.1%, and 1.3% respectively. Table 6 also shows the t test significance when comparing each single data mining technique and comparing it with the integration of k means clustering. The t test is calculated using 95% significance. The t test calculation is showing that there is significant increase in the accuracy between each single data mining technique and when integrating it with k means clustering as shown in Table 6. It is also showing that there is significant difference when comparing the integrated k means clustering k nearest neighbour with the integrated k means clustering nave bayes and decision tree. Although decision tree showed the least accuracy in the diagnosis of heart disease patients, it can still be very useful in the diagnosis of heart disease patients as it provides rules that explain how the patients are diagnosed as healthy or sick. The resulting rule sets offer the rules for explaining the diagnosis. The nave bayes and k-nearest neighbour are based on probability and distance measuring of the instances respectively. They do not provide any diagnostic explanation of why each testing instance is diagnosed as healthy or sick. TABLE 6: SINGLE AND HYBRID DATA MINING TECHNIQUES
Accuracy Techniques Mean St Dev Increase in Accuracy

K=19 And Decision Tree

85.7

5.4 4.5

81.2

6.2

Yes (t=9.48, p<=0.05)

The inlier k-means clustering method is showing the best results among other initial centroid selection methods. It is showing the best results when integrated with the k nearest neighbour. The random row is showing the best results among other initial centroid selection methods when integrated with the nave bayes. When integrating different initial centroid selection methods with different data mining techniques, the best results for each data mining technique is achieved with different numbers of clusters. There is not a specific number of clusters that is showing the best results for every hybrid data mining technique. When comparing integrating 2 clusters inlier k means clustering k-nearest neighbour with previous published results, this integration is showing better results achieving 85.7% as shown in Table 7. It is showing better results than bagging algorithm which was showing 81.41% as shown Table 7.

TABLE 7: COMPARING INTEGRATING K-MEANS CLUSTERING


AND DIFFERENT DATA MINING TECHNIQUES WITH PREVIOUS PUBLISHED RESULTS

Author/ Year Cheung, 2001

Technique Decision Tree Nave Bayes

Accuracy 81.11% 81.48%

T-test Significance

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 9

Author/ Year Tu, et al., 2009

Technique J4.8 Decision Tree Bagging Algorithm 2 Clusters Inlier K=19 Nearest Neighbour

Accuracy 78.91% 81.41% 85.7%

Content/Prevention+and+awareness-2. 7.Helma, C., E. Gottmann, and S. Kramer, Knowledge discovery and data mining in toxicology. Statistical Methods in Medical Research, 2000. 8.Podgorelec, V., et al., Decision Trees: An Overview and Their Use in Medicine. Journal of Medical Systems, 2002. Vol. 26. 9.Han, j. and M. Kamber, Data Mining Concepts and Techniques. 2006: Morgan Kaufmann Publishers. 10.Lee, I.-N., S.-C. Liao, and M. Embrechts, Data mining techniques applied to medical information. Med. inform, 2000. 11.Obenshain, M.K., Application of Data Mining Techniques to Healthcare Data. Infection Control and Hospital Epidemiology, 2004. 12.Sandhya, J., et al., Classification of Neurodegenerative Disorders Based on Major Risk Factors Employing Machine Learning Techniques. International Journal of Engineering and Technology, 2010. Vol.2, No.4. 13.Thuraisingham, B., A Primer for Understanding and Applying Data Mining. IT Professional IEEE, 2000. 14.Ashby, D. and A. Smith, The Best Medicine? Plus Magazine Living Mathematics., 2005. 15.Porter, T. and B. Green, Identifying Diabetic Patients: A Data Mining Approach. Americas Conference on Information Systems, 2009. 16.Panzarasa, S., et al., Data mining techniques for analyzing stroke care processes. Proceedings of the 13th World Congress on Medical Informatics, 2010. 17.Li L, T.H., Wu Z, Gong J, Gruidl M, Zou J, Tockman M, Clark RA, Data mining techniques for cancer detection using serum proteomic profiling. Artificial Intelligence in Medicine, Elsevier, 2004. 18.Das, R., I. Turkoglu, and A. Sengur, Effective diagnosis of heart disease through neural networks ensembles. Expert Systems with Applications, Elsevier, 2009. 36 (2009): p. 76757680. 19.Andreeva, P., Data Modelling and Specific Rule Generation via Data Mining Techniques. International Conference on Computer Systems and Technologies - CompSysTech, 2006. 20.Hara, A. and T. Ichimura, Data Mining by Soft Computing Methods for The Coronary Heart Disease Database. Fourth International Workshop on Computational Intelligence & Applications, IEEE, 2008. 21.Rajkumar, A. and G.S. Reena, Diagnosis Of Heart Disease Using Datamining Algorithm. Global Journal of Computer Science and

Our Work Best Results

Although integrating kmeans clustering with decision tree, nave bayes, and k-nearest neighbour; is achieving enhancement in accuracy, however the number of instances is relatively small in the CHDD. A larger dataset is needed to identify if two clusters inlier initial centroid selection method k nearest neighbour will still provide the best results. Also, the target attribute of the CHDD has two values: healthy and sick. Further investigation is also needed to identify if there is a relationship between the number of clusters showing best results and the number of values of the target attribute.

8 CONCLUSION
This research is part of an effort to provide assistance to healthcare professionals in the early diagnosis of heart disease. Early diagnosis offers the opportunity for early intervention in this preventable killer disease. Our research indicated that two clusters inlier initial centroid selection method k=19 nearest neighbour offers a significantly better level of prediction accuracy in heart disease diagnosis than the other single or hybrid techniques. Our work has limitations and further research is underway to address them. In particular, the CHDD is a relatively small dataset for data mining (297 rows). Furthermore, our hybridization techniques partitioned that dataset further, sometimes down to only scores of records. Our further work will repeat the investigation on another larger dataset to validate these results.

REFERENCES
1.World Health Organization. 2007 7-Febuary 2011]; Available 2010 7-February-2011]; Available from: from: http://www.who.int/mediacentre/factsheets/fs310.pdf. 2.European Public Health Alliance. 3.ESCAP. death.asp. 4.Statistics South Africa. 2008 7-February-2011]; Available from: http://www.statssa.gov.za/publications/P03093/P030932006.pdf 5.Australian Bureau of Statistics. 2010 7-February-2011]; Available from: http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/E8510D1C8 DC1AE1CCA2576F600139288/$File/33030_2008.pdf 6.Department of Health & Aging, A.G. August 2012; Available from: http://www.agedcareaustralia.gov.au/internet/agedcare/publishing.nsf/ 2010 Available from: http://www.epha.org/a/2352 7-February-2011]; http://www.unescap.org/stat/data/syb2009/9.Health-risks-causes-of-

Technology, 2010. Vol. 10 (Issue 10). 22.Sitar-Taut, V.A., et al., Using machine learning algorithms in cardiovascular disease risk evaluation. Journal of Applied Computer Science & Mathematics, 2009. 23.Srinivas, K., B.K. Rani, and A. Govrdhan, Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks. International Journal on Computer Science and Engineering (IJCSE), 2010. Vol. 02, No. 02: p. 250-255. 24.Yan, H., et al., Development of a decision support system for heart disease diagnosis using multilayer perceptron. Proceedings of the 2003 International Symposium on, 2003. vol.5: p. pp. V-709- V-712. 25.Anbarasi, M., E. Anupriya, and N.C.S.N. Iyengar, Enhanced Prediction of Heart Disease with Feature Subset Selection using Genetic Algorithm. International Journal of Engineering Science and Technology, 2010. Vol. 2(10). 26.Polat , K., S. Sahan, and S. Gunes, Automatic detection of heart

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 2013 10

disease using an artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest neighbour) based weighting preprocessing. Expert Systems with Applications 2007. 32 p. 625631. 27.Xiong, W. and C. Wang, Feature Selection: A Hybrid Approach Based on Selfadaptive Ant Colony and Support Vector Machine. International Conference on Computer Science and Software Engineering, IEEE, 2008. 28.Centers for Disease Control and Prevention (CDC), C.D.P.a.H.P. August 2012; Available from: http://www.cdc.gov/nccdphp/. 29.American Cancer Society, B.C.F.F. August 2012; Available from: http://www.cancer.org/downloads/STT/F861009_final 9-08-09.pdf. 30.Paladugu, S. and C. Shyu, Temporal mining framework for risk reduction and early detection of chronic diseases. 2010. 31.Heller, R.F., et al., How well can we predict coronary heart disease? Findings in the United Kingdom Heart Disease Prevention Project. BRITISH MEDICAL JOURNAL, 1984. 32.Wilson, P.W.F., et al., Prediction of Coronary Heart Disease Using Risk Factor Categories. American Heart Association Journal, 1998. 33.Simons, L.A., et al., Risk functions for prediction of cardiovascular disease in elderly Australians: the Dubbo Study. Medical Journal of Australia, 2003. 178. 34.Salahuddin and F. Rabbi, Statistical Analysis of Risk Factors for Cardiovascular disease in Malakand Division. Pak. j. stat. oper. res., 2006. Vol.II: p. pp49-56. 35.Shahwan-Akl, L., Cardiovascular Disease Risk Factors among Adult Australian-Lebanese in Melbourne. International Journal of Research in Nursing, 2010. 6 (1). 36.Tu, M.C., D. Shin, and D. Shin, Effective Diagnosis of Heart Disease through Bagging Approach. Biomedical Engineering and Informatics, IEEE, 2009. 37.Shouman, M., T. Turner, and R. Stocker, Using data mining techniques in heart disease diagnosis and treatment. Japan-Egypt Conference on Electronics, Communications and Computers (JECECC), 2012: p. 173 -177. 38.Blake, C.L. and C.J. Merz, UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences. , 1998 39.Parthiban, L. and R. Subramanian, Intelligent Heart Disease Prediction System using CANFIS and Genetic Algorithm. International Journal of Biological and Life Sciences, 2007. 40.Patil, S.B. and Y.S. Kumaraswamy, Extraction of Significant Patterns from Heart Disease Warehouses for Heart Attack Prediction. International Journal of Computer Science and Network Security (IJCSNS), 2009. VOL.9 No.2. 41.Polat, K., et al., A new classification method to diagnosis heart disease: Supervised artificial immune system (AIRS). In Proceedings of the Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN), 2005. 42.Kavitha, K.S., K.V. Ramakrishnan, and M.K. Singh, Modeling and design of evolutionary neural network for heart disease detection. International Journal of Computer Science Issues (IJCSI ), 2010. Vol. 7, Issue 5. 43.Wu, X., et al., Top 10 algorithms in data mining analysis. Knowl. Inf. Syst., 2007. 44.Tajunisha, N. and V. Saravanan, A new approach to improve the clustering accuracy using informative genes for unsupervised

microarray data sets. International Journal of Advanced Science and Technology, 2011. 45.Bramer, M., Principles of data mining. 2007: Springer. 46.Esposito, F., D. Malerba, and G. Semeraro A Comparative Analysis of Methods for Pruning Decision Trees. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997. VOL. 19, NO. 5. 47.Alpaydin, E., Voting over Multiple Condensed Nearest Neighbors. Artificial Intelligence Review, 1997. 11: p. 115132. 48.Pavan, K.K., et al., Robust seed selection algorithm for k-means type algorithms. International Journal of Computer Science & Information Technology (IJCSIT) 2011. Vol 3, No 5. 49.Poomagal, S. and T. Hamsapriya, Optimized k-means clustering with intelligent initial centroid selection for web search using URL and tag contents, in Proceedings of the International Conference on Web Intelligence, Mining and Semantics. 2011, ACM: Sogndal, Norway. p. 1-8. 50.Santhi, M.V.B.T., et al., Enhancing K-Means Clustering Algorithm. International Journal of Computer Sci ence & Technology, 2011. 51.Immaculate Mary, C. and D.S.V. Kasmir Raja, Refinement of clusters from k-means with ant colony optimization. Journal of Theoretical and Applied Information Technology, 2009. 52.Khan, D.M. and N. Mohamudally, A Multiagent System (MAS) for the Generation of Initial Centroids for kmeans Clustering Data Mining Algorithm based on Actual Sample Datapoints. Journal of Next Generation Information Technology, August, 2010. Volume 1, Number 2.