Professional Documents
Culture Documents
Cleveland Clinic Foundation heart disease dataset is available for the sake
of determining the accuracy rate in India. However initially records of about 500
diabetic patients from Seshiah Diabetic Research Institute in Chennai, India to
perform the experiments were collected for the present study. The clinical data set
specification provides concise, unambiguous definition for items related to
diabetes.
61
The diabetes attributes used in our Nave Bayes experiment and their
descriptions are shown in Table 3.1.
excel or other database types of files. Then the raw data is changed into data sets
with a few appropriate characteristics.
Values are absent in most datasets with many causes contributing it. The raw
data usually have a great deal of noise, which is a random error or variance in a
measured variable. It cannot be used directly for processing, with the machinelearning algorithms. Data cleaning can be applied to remove noise and correct
inconsistencies in the data. Its routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.
Cleaning and filtering of the data have necessarily to be carried out with
respect to the data in data mining algorithm to avoid the creation of deceptive or
inappropriate rules or patterns. To make the data appropriate for the mining process,
it needs to be transformed.
Data integration merges data from multiple sources into a coherent data
store, like a data warehouse or a data cube. Careful integration of the data from
multiple sources helps in reducing and avoiding redundancies and inconsistencies
in the resulting data set. This helps in improving the accuracy and speed of the
subsequent mining process. Data reduction can reduce the data size by aggregating
and eliminating redundant features. Using the data mining techniques, the focus is
on specific fields that allow exploration of the data, by selecting and filtering some
fields as input, output fields and predictive fields.
All attributes used in Nave Bayes experiments listed in Table 3.1 with the
exception of sex and family heredity have numeric values. The attribute sex takes
on values M or F to denote male or female respectively. The attribute family
heredity takes on values Father, Mother or Both. In case there is no previous
diabetes history for the patient, this attribute is left empty. Since no attribute value
should be left empty for the mining algorithm to work properly, the value No for
patients without any previous diabetes history can be used. Likewise, a categorical
attribute based on which the data sets are to be classified is needed.
63
The aim of the present work is to predict the chances of a diabetic patient
getting heart disease. Hence, the LP Tot Y/N attribute has been taken as the class
attribute. Since the LP Tot Y/N attribute is a numeric attribute, the attribute values
have been categorized as high cholesterol value (Yes) or low cholesterol value
(No). Under the data exploration mode, almost all attribute selection modules
applicable for the data have been explored with a view to collect optimal subset of
attributes to predict the risk factors of diabetic patients getting various types of
heart diseases.
3.2 SUPPORT VECTOR MACHINE APPROACH
In the second experiment in the present work, Support vector machine data
mining classifier technique has been used with radial basis function kernel to
diagnose vulnerability of diabetic patients to heart disease. Most of these systems
have successfully employed SVM for the classification purpose. On the evidence of
this, SVM classifier has been used in the experiments that figure in the present
work. The results of the proposed system are quite good. The system exhibits good
accuracy in predicting the vulnerability of diabetic patients to heart diseases.
3.2.1 Data set and Parameters used in SVM
Here the methodology described is diagnosing vulnerability of diabetic
patients to heart diseases and records of about 500 diabetic patients as in 3.1.1
experiment. The diabetes attributes used in our SVM proposed system and their
descriptions are shown in Table 3.2 and all the attribute roles are regular except the
vulnerability attribute.
64
Table 3.2 Diabetes attributes used in our Support vector machine experiments
Attribute
role
Regular
Regular
Regular
Regular
Regular
Regular
Regular
Regular
Regular
Regular
Label
Out of the 500 records, 142 related to patients highly vulnerable to heart
diseases, and the remaining 358 patients found less vulnerable to heart disease.
Since SVM processes only numeric attributes, the nominal attributes were
converted to numeric attributes by replacing each value by a unique integer. For
example, the attribute sex values are converted as follows: Male 1 and female 0.
The values of the attributes are then normalized to the range 0 to 1. These records
were then given as input to the SVM classifier and the performance of support
vector algorithm has been analyzed. Hence this SVM model can be recommended
for the classification of the diabetic dataset.
65
66
67
Table 3.4 Data structure and summary used in our NB, SVM, DT experiments
Attribute
Sex
Age
Heridity
Smoking
Alcohol
BP
Fasting
PP
A1C
LP Tot
LDL
VLDL
TGL
Cholesterol HDL
attribute_Label
68
70