You are on page 1of 21

CHAPTER 1

INTRODUCTION

In day to day life many factors that affect a human heart. Many problems are occurring at
a rapid pace and new heart diseases are rapidly being identified. In today’s world of stress Heart,
being an essential organ in a human body which pumps blood through the body for the blood
circulation is essential and its health is to be conserved for a healthy living. The health of a
human heart is based on the experiences in a person’s life and is completely dependent on
professional and personal behaviors of a person.
There may also be several genetic factors through which a type of heart disease is passed
down from generations. According to the World Health Organization, every year more than 12
million deaths are occurring worldwide due to the various types of heart diseases which is also
known by the term cardiovascular disease. The term Heart disease includes many diseases that
are diverse and specifically affect the heart and the arteries of a human being. Even young aged
people around their 20-30 years of lifespan are getting affected by heart diseases.
The increase in the possibility of heart disease among young may be due to the bad eating
habits, lack of sleep, restless nature, depression and numerous other factors such as obesity, poor
diet, family history, high blood pressure, high blood cholesterol, idle behavior, family history,
smoking and hypertension. The diagnosis of the heart diseases is a very important and is itself
the most complicated task in the medical field. All the mentioned factors are taken into
consideration when analyzing and understanding the patients by the doctor through manual
check-ups at regular intervals of time. The symptoms of heart disease greatly depend upon
which of the discomfort felt by an individual. Some symptoms are not usually identified by the
common people.
However, common symptoms include chest pain, breathlessness, and heart palpitations.
The chest pain common to many types of heart disease is known as angina, or angina pectoris,
and occurs when a part of the heart does not receive enough oxygen. Angina may be triggered by
stressful events or physical exertion and normally lasts under 10 minutes. Heart attacks can also
occur as a result of different types of heart disease. The signs of a heart attack are similar to
angina except that they can occur during rest and tend to be more severe. The symptoms of a
heart attack can sometimes resemble indigestion.
Heartburn and a stomach ache can occur, as well as a heavy feeling in the chest. Other
symptoms of a heart attack include pain that travels through the body, for example from the chest
to the arms, neck, back, abdomen, or jaw, lightheadedness and dizzy sensations, profuse
sweating, nausea and vomiting.
Heart failure is also an outcome of heart disease, and breathlessness can occur when the
heart becomes too weak to circulate blood. Some heart conditions occur with no symptoms at all,
especially in older adults and individuals with diabetes. The term 'congenital heart disease'
covers a range of conditions, but the general symptoms include sweating, high levels of fatigue,
fast heartbeat and breathing, breathlessness, chest pain. However, these symptoms might not
develop until a person is older than 13 years. In these types of cases, the diagnosis becomes an
intricate task requiring great experience and high skill. A risk of a heart attack or the possibility
of the heart disease if identified early, can help the patients take precautions and take regulatory
measures. Recently, the healthcare industry has been generating huge amounts of data about
patients and their disease diagnosis reports are being especially taken for the prediction of heart
attacks worldwide. When the data about heart disease is huge, the machine learning techniques
can be implemented for the analysis.
Data Mining is a task of extracting the vital decision making information from a
collective of past records for future analysis or prediction. The information may be hidden and is
not identifiable without the use of data mining. The classification is one data mining technique
through which the future outcome or predictions can be made based on the historical data that is
available. The medical data mining made a possible solution to integrate the classification
techniques and provide computerized training on the dataset that further leads to exploring the
hidden patterns in the medical data sets which is used for the prediction of the patient’s future
state. Thus, by using medical data mining it is possible to provide insights on a patient’s history
and is able to provide clinical support through the analysis. For clinical analysis of the patients,
these patterns are very much essential. In simple English, the medical data mining uses
classification algorithms that are a vital part for identifying the possibility of heart attack before
the occurrence.
The classification algorithms can be trained and tested to make the predictions that
determine the person’s nature of being affected by heart disease. In this research work, the
supervised machine learning concept is utilized for making the predictions. A comparative
analysis of the three data mining classification algorithms namely Random Forest, Decision Tree
and Naïve Bayes are used to make predictions. The analysis is done at several levels of cross
validation and several percentage of percentage split evaluation methods respectively.
The StatLog dataset from UCI machine learning repository is utilized for making heart
disease predictions in this research work. The predictions are made using the classification model
that is built from the classification algorithms when the heart disease dataset is used for training.
This final model can be used for prediction of any types of heart diseases.
CHAPTER 2
LITERATURE SURVEY
2.1 PREDICTING SURVIVAL CAUSES AFTER OUT OF HOSPITAL CARDIAC
ARREST USING DATA MINING METHOD
AUTHORS
FRANCK LE DUFF
CRISTIAN MUNTEAN
MARC CUGGIA
PHILIPPE MABO
In this paper [1] the authors stated that the prognosis of life for patients with heart failure
remains poor. By using data mining methods, the purpose of this study was to evaluate the most
important criteria for predicting patient survival and to profile patients to estimate their survival
chances together with the most appropriate technique for health care. METHODS: Five hundred
and thirty three patients who had suffered from cardiac arrest were included in the analysis.
They performed classical statistical analysis and data mining analysis using mainly
Bayesian networks. RESULTS: The mean age of the 533 patients was 63 (± 17) and the sample
was composed of 390 (73 %) men and 143 (27 %) women. Cardiac arrest was observed at home
for 411 (77 %) patients, in a public place for 62 (12 %) patients and on a public highway for60
(11 %) patients. The belief network of the variables showed that the probability of remaining
alive after heart failure is directly associated to five variables: age, sex, the initial cardiac rhythm,
the origin of the heart failure and specialized resuscitation techniques employed.
CONCLUSIONS: Data mining methods could help clinicians to predict the survival of
patients and then adapt their practices accordingly. This work could be carried out for each
medical procedure or medical problem and it would become possible to build a decision tree
rapidly with the data of a service or a physician. The comparison between classic analysis and
data mining analysis showed us the contribution of the data mining method for sorting variables
and quickly conclude on the importance or the impact of the data and variables on the criterion
of the study. The main limit of the method is knowledge acquisition and the necessity to gather
sufficient data to produce a relevant model.
Cardiac arrest is defined as a spontaneous irreversible arrest of the general circulation by
cardiac inefficacity. It is recognized with the absence of the femoral pulse for more than 5
seconds. Without resuscitation, cardiac arrest leads to sudden cardiac death. The public health
impact of sudden cardiac death is heavy since the survival rate is estimated at between 1 and 20
% for cardiac arrest patients. This represents 300,000 to 400,000 deaths a year in the United
States and 20,000 to 30,000 deaths in France. The profile of the patient is now well known since
it generally concerns men from about 45 to 75 years. Hospitalization must be optimal and fast.
According to the type of cardiac attack, the procedure of assumption of responsibility can vary
and some studies [11] show the interest of various techniques over others, according to the cause
of the cardiac arrest. Heart disease is the number one cause of death in the U.S. Ac-cording to the
American Heart Association, an estimated 1,100,000 people in the U.S. will have a coronary
attack each year.
Ninety five percent of sudden cardiac arrest victims die before reaching the hospital and
heart disease claims more lives each year than the following six leading causes of death
combined (cancer, chronic lower respiratory diseases, accidents, diabetes mellitus, influenza and
pneumonia). Almost 150,000 people in the U.S. who die from heart disease each year are under
the age of 65. These data show the interest for predicting the risk of death after heart failure and
the need to analyze the events that occurred during care to provide prognostic information.
Classic statistical analyses have already been done and provide some information about
epidemiology of the heart failure and causes of the failure. This paper presents the use of a
probability in a statistical approach in order to profile heart failure in a sample of patients and
predict the impact of some events in the care process.
They concluded that it seems that the use of the Bayesian network in medical analysis
could be useful to explore data and to find hid-den relationships between events or characteristics
of the sample. It is a first approach for discussing hypotheses between clinicians and statistical
experts. The main limit of these tools is the necessity to have enough data to find regularity in
the relationships
2.2 KNOWLEDGE DISCOVERY IN DATABASES: AN OVERVIEW
AUTHORS
WILLIAM J. FRAWLEY
GREGORY PIATETSKY-SHAPIRO
CHRISTOPHER J. MATHEUS
In this paper [2] the authors stated that after a decade of fundamental interdisciplinary
research in machine learning, the spadework in this field has been done; the 1990s should see the
widespread exploitation of knowledge discovery as an aid to assembling knowledge bases. The
contributors to the AAAI Press book Knowledge Discovery in Databases were excited at the
potential benefits of this research. The editors hope that some of this excitement will
communicate itself to AI Magazine readers of this article.
It has been estimated that the amount of information in the world doubles every 20
months. The size and number of databases probably increases even faster. In 1989, the total
number of databases in the world was estimated at five million, although most of them are small
DBASE III databases. The automation of business activities produces an ever-increasing stream
of data because even simple transactions, such as a telephone call, the use of a credit card, or a
medical test, are typically recorded in a computer.
Scientific and government databases are also rapidly growing. The National Aeronautics
and Space Administration already has much more data than it can analyze. Earth observation
satellites, planned for the 1990s, are expected to generate one terabyte (1015 bytes) of data every
day— more than all previous missions combined. At a rate of one picture each second, it would
take a person several years (working nights and weekends) just to look at the pictures generated
in one day. In biology, the federally funded Human Genome project will store thousands of bytes
for each of the several billion genetic bases.
Closer to everyday lives, the 1990 U.S. census data of a million million bytes encode
patterns that in hidden ways describe the lifestyles and subcultures of today’s United States.
What are we supposed to do with this flood of raw data? Clearly, little of it will ever be seen by
human eyes. If it will be understood at all, it will have to be analyzed by computers. Although
simple statistical techniques for data analysis were developed long ago, advanced techniques for
intelligent data analysis are not yet mature.
As a result, there is a growing gap between data generation and data understanding. At
the same time, there is a growing realization and expectation that data, intelligently analyzed and
presented, will be a valuable resource to be used for a competitive advantage. The computer
science community is responding to both the scientific and practical challenges presented by the
need to find the knowledge adrift in the flood of data.
In assessing the potential of AI technologies, Michie (1990), a leading European expert
on machine learning, predicted that “the next area that is going to explode is the use of machine
learning tools as a component of large-scale data analysis.” A recent National Science
Foundation workshop on the future of database research ranked data mining among the most
promising research topics for the 1990s.
Some research methods are already well enough developed to have been made part of
commercially available software. Several expert system shells use variations of ID3 for inducing
rules from examples. Other systems use inductive, neural net, or genetic learning approaches to
discover patterns in personal computer databases. Many forward-looking companies are using
these and other tools to analyze their databases for interesting and useful patterns.
American Airlines searches its frequent flyer database to find its better customers,
targeting them for specific marketing promotions. Farm Journal analyzes its subscriber database
and uses advanced printing technology to custom-build hundreds of editions tailored to particular
groups. Several banks, using patterns discovered in loan and credit histories, have derived better
loan approval and bankruptcy prediction methods. General Motors is using a database of
automobile trouble reports to derive diagnostic expert systems for various models. Packaged-
goods manufacturers are searching the supermarket scanner data to measure the effects of their
promotions and to look for shopping patterns.
A combination of business and research interests has produced increasing demands for,
as well as increased activity to provide, tools and techniques for discovery in databases. This
book is the first to bring together leading-edge research from around the world on this topic. It
spans many different approaches to discovery, including inductive learning, Bayesian statistics,
semantic query optimization, knowledge acquisition for expert systems, information theory, and
fuzzy sets.
The book is aimed at those interested or involved in computer science and the
management of data, to both inform and inspire further research and applications. It will be of
particular interest to professionals working with databases and management information systems
and to those applying machine learning to real-world problems.
What Is Knowledge Discovery? There is an immense diversity of current research on
knowledge discovery in databases. To provide a point of reference for this research, we begin
here by defining and explaining relevant terms.
Definition of Knowledge Discovery:
Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data. Given a set of facts (data) F, a language L, and some
measure of certainty C, we define a pattern as a statement S in L that describes relationships
among a subset FS of F with a certainty c, such that S is simpler (in some sense) than the
enumeration of all facts in FS.
A pattern that is interesting (according to a user-imposed interest measure) and certain
enough (again according to the user’s criteria) is called knowledge. The output of a program that
monitors the set of facts in a database and produces patterns in this sense is discovered
knowledge. These definitions about the language, the certainty, and the simplicity and
interestingness measures are intentionally vague to cover a wide variety of approaches.
Collectively, these terms capture our view of the fundamental characteristics of discovery in
databases. In the following paragraphs, we summarize the connotations of these terms and
suggest their relevance to the problem of knowledge discovery in databases.
They noted that one of the earliest discovery processes was encountered by Jonathan
Swift’s Gulliver in his visit to the Academy of Labado. The “Project for improving speculative
Knowledge by practical and mechanical operations” was generating sequences of words by
random permutations and “where they found three or four Words that might make Part of a
Sentence, they dictated them to . . . Scribes.” This process, although promising to produce many
interesting sentences in the (very) long run, is rather inefficient and was recently proved to be
NP-hard.

2.3 ASSOCIATIVE CLASSIFICATION APPROACH FOR DIAGNOSING


CARDIOVASCULAR DISEASE
AUTHORS
KIYONG NOH
HEON GYU LEE
HO-SUN SHON
BUM JU LEE
KEUN HO RYU
In this paper [3] the authors stated that ECG is a test that measures a heart’s electrical
activity, which provides valuable clinical information about the heart’s status. In this paper, they
proposed a classification method for extracting multi-parametric features by analyzing HRV
from ECG, data preprocessing and heart disease pattern. The proposed method is an associative
classifier based on the efficient FP-growth method. Since the volume of patterns produced can be
large, they offered a rule cohesion measure that allows a strong push of pruning patterns in the
pattern generating process. They conducted an experiment for the associative classifier, which
utilizes multiple rules and pruning, and biased confidence (or cohesion measure) and dataset
consisting of 670 participants distributed into two groups, namely normal people and patients
with coronary artery disease.
The most widely used signal in clinical practice is ECG (Electrocardiogram), which is
frequently recorded and widely used for the assessment of cardiac function. ECG processing
techniques have been proposed to support pattern recognition, parameter extraction, spectro-
temporal techniques for the assessment of the heart’s status, denoising, baseline correction and
arrhythmia detection. Control of the heart rate is known to be affected by the sympathetic and
parasympathetic nervous system. It is reported that Heart Rate Variability (HRV) is related to
autonomic nerve activity and is used as a clinical tool to diagnose cardiac autonomic function in
both health and disease. This paper provides a classification technique that could automatically
diagnose Coronary Artery Disease (CAD) under the framework of ECG patterns and clinical
investigations. Through the ECG pattern we are able to recognize the features that could well
reflect either the existence or non-existence of CAD. Such features can be perceived through
HRV analysis based on following knowledge:
1. In patients with CAD, reduction of the cardiac vagal activity evaluated by spectral
HRV analysis was found to correlate with the angiographic severity.
2. The reduction of variance (standard deviation of all normal RR intervals) and low-
frequency of HRV seem related to an increase in chronic heart failure.

They concluded that they investigated an effectiveness and accuracy of classification


method for diagnosing ECG patterns. To achieve this purpose, we introduce an associative
classifier that is further extended from CMAR by using a cohesion measure for pruning
redundant rules.
Their classification method uses multiple rules to predict the highest probability classes
for each record. The proposed associative classifier can also relax the independence assumption
of some classifiers, such as NB (Naive Bayesian) and DT (Decision Tree). For example, the NB
makes the assumption of conditional independence, that is, given the class label of a sample, the
values of the attributes are conditionally independent of one another. When the assumption holds
true, then the NB is the most accurate in comparison with all other classifiers. In practice,
however, dependences can exist between variables of the real data. Their classifier can consider
the dependences of linear characteristics of HRV and clinical information. Finally, they
implemented their classifier and several different classifiers to validate their accuracy in
diagnosing heart disease.

2.4 INTELLIGENT HEART DISEASE PREDICTION SYSTEM USING CANFIS AND


GENETIC ALGORITHM
AUTHORS
LATHA PARTHIBAN
R.SUBRAMANIAN
In this paper [4] the authors stated that Heart disease (HD) is a major cause of morbidity
and mortality in the modern society. Medical diagnosis is an important but complicated task that
should be performed accurately and efficiently and its automation would be very useful. All
doctors are unfortunately not equally skilled in every sub specialty and they are in many places a
scarce resource. A system for automated medical diagnosis would enhance medical care and
reduce costs. In this paper, a new approach based on coactive neuro-fuzzy inference system
(CANFIS) was presented for prediction of heart disease. The proposed CANFIS model
combined the neural network adaptive capabilities and the fuzzy logic qualitative approach
which is then integrated with genetic algorithm to diagnose the presence of the disease.
The performances of the CANFIS model were evaluated in terms of training
performances and classification accuracies and the results showed that the proposed CANFIS
model has great potential in predicting the heart disease.
A major challenge facing healthcare organizations (hospitals, medical centers) is the
provision of quality services at affordable costs. Quality service implies diagnosing patients
correctly and administering treatments that are effective.
Poor clinical decisions can lead to disastrous consequences which are therefore
unacceptable. Clinical decisions are often made based on doctors’ intuition and experience rather
than on the knowledge rich data hidden in the database. This practice leads to unwanted biases,
errors and excessive medical costs which affects the quality of service provided to patients. Wu,
et al proposed that integration of clinical decision support with computer-based patient records
could reduce medical errors, enhance patient safety, decrease unwanted practice variation, and
improve patient outcome [12].
Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [13]. Unfortunately, these data are rarely used to support clinical
decision making. The main objective of this research is to develop a prototype Intelligent Heart
Disease Prediction System with CANFIS and genetic algorithm using historical heart disease
databases to make intelligent clinical decisions which traditional decision support systems
cannot.
The cost of management of HD is a significant economic burden and so prevention of
heart disease is very important step in the management. Prevention of HD can be approached in
many ways including health promotion campaigns, specific protection strategies, life style
modification programs, early detection and good control of risk factors and constant vigilance of
emerging risk factors. The CANFIS model integrates adaptable fuzzy inputs with a modular
neural network to rapidly and accurately approximate complex functions. Fuzzy inference
systems are also valuable, as they combine the explanatory nature of rules (MFs) with the power
of neural networks. These kinds of networks solve problems more efficiently than neural
networks when the underlying function to model is highly variable or locally extreme [14].
They concluded that from their studies, they have managed to achieve our research
objectives. Available dataset of Heart disease from UCI Machine Learning Repository has been
studied and preprocessed and cleaned out to prepare it for classification process. Coactive Neuro-
fuzzy modeling was proposed as a dependable and robust method developed to identify a
nonlinear relationship and mapping between the different attributes. It has been shown that of
GA is a very useful technique for auto-tuning of the CANFIS parameters and selection of
optimal feature set. The fact is that computers cannot replace humans and by comparing the
computer-aided detection results with the pathologic findings, doctors can learn more about the
best way to evaluate areas that computer aided detection highlights.
2.5 INTELLIGENT HEART DISEASE PREDICTION SYSTEM USING DATA MINING
TECHNIQUES
AUTHORS
SELLAPPAN PALANIAPPAN
RAFIAH AWANG
In this paper [5] the authors stated that The healthcare industry collects huge amounts of
healthcare data which, unfortunately, are not “mined” to discover hidden information for
effective decision making. Discovery of hidden patterns and relationships often goes
unexploited. Advanced data mining techniques can help remedy this situation. This research has
developed a prototype Intelligent Heart Disease Prediction System (IHDPS) using data mining
techniques, namely, Decision Trees, Naïve Bayes and Neural Network.
Results show that each technique has its unique strength in realizing the objectives of the
defined mining goals. IHDPS can answer complex “what if” queries which traditional decision
support systems cannot. Using medical profiles such as age, sex, blood pressure and blood sugar
it can predict the likelihood of patients getting a heart disease. It enables significant knowledge,
e.g. patterns, relationships between medical factors related to heart disease, to be established.
IHDPS is Web-based, user-friendly, scalable, reliable and expandable. It is implemented on
the .NET platform.
A major challenge facing healthcare organizations (hospitals, medical centers) is the
provision of quality services at affordable costs. Quality service implies diagnosing patients
correctly and administering treatments that are effective. Poor clinical decisions can lead to
disastrous consequences which are therefore unacceptable. Hospitals must also minimize the cost
of clinical tests. They can achieve these results by employing appropriate computer-based
information and/or decision support systems.
Most hospitals today employ some sort of hospital information systems to manage their
healthcare or patient data [15]. These systems typically generate huge amounts of data which
take the form of numbers, text, charts and images. Unfortunately, these data are rarely used to
support clinical decision making. There is a wealth of hidden information in these data that is
largely untapped. This raises an important question: “How can we turn data into useful
information that can enable healthcare practitioners to make intelligent clinical decisions?” This
is the main motivation for this research.
Many hospital information systems are designed to support patient billing, inventory
management and generation of simple statistics. Some hospitals use decision support systems,
but they are largely limited. They can answer simple queries like “What is the average age of
patients who have heart disease?”, “How many surgeries had resulted in hospital stays longer
than 10 days?”, “Identify the female patients who are single, above 30 years old, and who have
been treated for cancer.” However, they cannot answer complex queries like “Identify the
important preoperative predictors that increase the length of hospital stay”, “Given patient
records on cancer, should treatment include chemotherapy alone, radiation alone, or both
chemotherapy and radiation?”, and “Given patient records, predict the probability of patients
getting a heart disease.”
Clinical decisions are often made based on doctors’ intuition and experience rather than
on the knowledge-rich data hidden in the database. This practice leads to unwanted biases, errors
and excessive medical costs which affects the quality of service provided to patients. Wu, et al
proposed that integration of clinical decision support with computer-based patient records could
reduce medical errors, enhance patient safety, decrease unwanted practice variation, and improve
patient outcome [16]. This suggestion is promising as data modelling and analysis tools, e.g.,
data mining, have the potential to generate a knowledge-rich environment which can help to
significantly improve the quality of clinical decisions.
The main objective of this research is to develop a prototype Intelligent Heart Disease
Prediction System (IHDPS) using three data mining modeling techniques, namely, Decision
Trees, Naïve Bayes and Neural Network. IHDPS can discover and extract hidden knowledge
(patterns and relationships) associated with heart disease from a historical heart disease database.
It can answer complex queries for diagnosing heart disease and thus assist healthcare
practitioners to make intelligent clinical decisions which traditional decision support systems
cannot. By providing effective treatments, it also helps to reduce treatment costs. To enhance
visualization and ease of interpretation, it displays the results both in tabular and graphical forms.
IHDPS uses the CRISP-DM methodology to build the mining models. It consists of six
major phases: business understanding, data understanding, data preparation, modeling,
evaluation, and deployment. Business understanding phase focuses on understanding the
objectives and requirements from a business perspective, converting this knowledge into a data
mining problem definition, and designing a preliminary plan to achieve the objectives.
Data understanding phase uses the raw the data and proceeds to understand the data,
identify its quality, gain preliminary insights, and detect interesting subsets to form hypotheses
for hidden information. Data preparation phase constructs the final dataset that will be fed into
the modeling tools. This includes table, record, and attribute selection as well as data cleaning
and transformation. The modeling phase selects and applies various techniques, and calibrates
their parameters to optimal values. The evaluation phase evaluates the model to ensure that it
achieves the business objectives. The deployment phase specifies the tasks that are needed to use
the models [17]. Data Mining Extension (DMX), a SQL-style query language for data mining, is
used for building and accessing the models’ contents. Tabular and graphical visualizations are
incorporated to enhance analysis and interpretation of results.
A total of 909 records with 15 medical attributes (factors) were obtained from the
Cleveland Heart Disease database [18]. Figure 2.1 lists the attributes. The records were split
equally into two datasets: training dataset (455 records) and testing dataset (454 records). To
avoid bias, the records for each set were selected randomly. For the sake of consistency, only
categorical attributes were used for all the three models. All the non-categorical medical
attributes were transformed to categorical data. The attribute “Diagnosis” was identified as the
predictable attribute with value “1” for patients with heart disease and value “0” for patients with
no heart disease. The attribute “PatientID” was used as the key; the rest are input attributes. It is
assumed that problems such as missing data, inconsistent data, and duplicate data have all been
resolved.
Mining models:
Data Mining Extension (DMX) query language was used for model creation, model
training, model prediction and model content access. All parameters were set to the default
setting except for parameters “Minimum Support = 1” for Decision Tree and “Minimum
Dependency Probability = 0.005” for Naïve Bayes [19]. The trained models were evaluated
against the test datasets for accuracy and effectiveness before they were deployed in IHDPS. The
models were validated using Lift Chart and Classification Matrix.
They concluded that a prototype heart disease prediction system is developed using three
data mining classification modeling techniques. The system extracts hidden knowledge from a
historical heart disease database. DMX query language and functions are used to build and
access the models.
Predictable attribute
1. Diagnosis (value 0: < 50% diameter narrowing (no heart disease); value 1: > 50%
diameter narrowing (has heart disease))

Key attribute
1. PatientID – Patient’s identification number

Input attributes
1. Sex (value 1: Male; value 0 : Female)
2. Chest Pain Type (value 1: typical type 1 angina, value 2: typical type angina, value 3:
non-angina pain; value 4: asymptomatic)
3. Fasting Blood Sugar (value 1: > 120 mg/dl; value 0: < 120 mg/dl)
4. Restecg – resting electrographic results (value 0: normal; value 1: 1 having ST-T wave
abnormality; value 2: showing probable or definite left ventricular hypertrophy)
5. Exang – exercise induced angina (value 1: yes; value 0: no)
6. Slope – the slope of the peak exercise ST segment (value 1: unsloping; value 2: flat;
value 3: downsloping)
7. CA – number of major vessels colored by floursopy (value 0 – 3)
8. Thal (value 3: normal; value 6: fixed defect; value 7: reversible defect)
9. Trest Blood Pressure (mm Hg on admission to the hospital)
10. Serum Cholesterol (mg/dl)
11. Thalach – maximum heart rate achieved
12. Oldpeak – ST depression induced by exercise relative to rest 13. Age in Year

FIGURE 2.1 DESCRIPTION OF ATTRIBUTES

The models are trained and validated against a test dataset. Lift Chart and Classification
Matrix methods are used to evaluate the effectiveness of the models. All three models are able to
extract patterns in response to the predictable state. The most effective model to predict patients
with heart disease appears to be Naïve Bayes followed by Neural Network and Decision Trees.
Five mining goals are defined based on business intelligence and data exploration. The
goals are evaluated against the trained models. All three models could answer complex queries,
each with its own strength with respect to ease of model interpretation, access to detailed
information and accuracy. Naïve Bayes could answer four out of the five goals; Decision Trees,
three; and Neural Network, two. Although not the most effective model, Decision Trees results
are easier to read and interpret. The drill through feature to access detailed patients’ profiles is
only available in Decision Trees.
Naïve Bayes fared better than Decision Trees as it could identify all the significant
medical predictors. The relationship between attributes produced by Neural Network is more
difficult to understand. IHDPS can be further enhanced and expanded. For example, it can
incorporate other medical attributes besides the 15 listed in Figure 2.1. It can also incorporate
other data mining techniques, e.g., Time Series, Clustering and Association Rules. Continuous
data can also be used instead of just categorical data. Another area is to use Text Mining to mine
the vast amount of unstructured data available in healthcare databases. Another challenge would
be to integrate data mining and text mining [8].
2.6 INTELLIGENT AND EFFECTIVE HEART ATTACK PREDICTION SYSTEM
USING DATA MINING AND ARTIFICIAL NEURAL NETWORK
AUTHORS
B.P. SHANTHAKUMAR
Y.S.KUMARASWAMY

In this paper [6] the authors stated that the diagnosis of diseases is a vital and intricate job
in medicine. The recognition of heart disease from diverse features or signs is a multi-layered
problem that is not free from false assumptions and is frequently accompanied by impulsive
effects. Thus the attempt to exploit knowledge and experience of several specialists and clinical
screening data of patients composed in databases to assist the diagnosis procedure is regarded as
a valuable option. This research work is the extension of our previous research with intelligent
and effective heart attack prediction system using neural network. A proficient methodology for
the extraction of significant patterns from the heart disease warehouses for heart attack
prediction has been presented.
Initially, the data warehouse is pre-processed in order to make it suitable for the mining
process. Once the preprocessing gets over, the heart disease warehouse is clustered with the aid
of the K-means clustering algorithm, which will extract the data appropriate to heart attack from
the warehouse. Consequently the frequent patterns applicable to heart disease are mined with the
aid of the MAFIA algorithm from the data extracted. In addition, the patterns vital to heart attack
prediction are selected on basis of the computed significant weightage. The neural network is
trained with the selected significant patterns for the effective prediction of heart attack. They
have employed the Multi-layer Perceptron Neural Network with Back-propagation as the
training algorithm. The results thus obtained have illustrated that the designed prediction system
is capable of predicting the heart attack effectively.
2.7 PERFORMANCE EVALUATION OF K-MEANS AND HEIRARICHAL
CLUSTERING IN TERMS OF ACCURACY AND RUNNING TIME
AUTHORS
NIDHI SINGH
DIVAKAR SINGH
In this paper [7] the authors stated that as in today’s world large number of modern
techniques is evolving for scientific data collection, large number of data is getting accumulated
at various databases. Systematic data analysis methods are in need to gain/extract useful
information from rapidly growing databanks. Clustering analysis method is one of the main
analytical methods in data mining; in which k-means clustering algorithm is most
popularly/widely used for many applications. Clustering algorithm is divided into two categories:
partition and hierarchical clustering algorithm. This paper discusses one partition clustering
algorithm (kmeans) and one hierarchical clustering algorithm (agglomerative). K-means
algorithm has higher efficiency and scalability and converges fast when dealing with large data
sets. Hierarchical clustering constructs a hierarchy of clusters by either repeatedly merging two
smaller clusters into a larger one or splitting a larger cluster into smaller ones. Using WEKA data
mining tool we have calculated the performance of k-means and hierarchical clustering algorithm
on the basis of accuracy and running time.
As a result of modern methods for scientific data collection, huge quantities of data are
getting accumulated at various databases. Cluster analysis [9] is one of the major data analysis
methods which helps to identify the natural grouping in a set of data items. The K-Means
clustering algorithm is proposed by Mac Queen in 1967 which is a partition-based cluster
analysis method. Clustering is a way that classifies the raw data reasonably and searches the
hidden patterns that may exist in datasets [10]. It is a process of grouping data objects into
disjointed clusters so that the data’s in the same cluster is similar, yet data’s belonging to
different cluster differ. K-means is a numerical, unsupervised, non-deterministic, iterative
method. It is simple and very fast, so in many practical applications, the method is proved to be a
very effective way that can produce good clustering results. The demand for organizing the sharp
increasing data’s and learning valuable information from data, which makes clustering
techniques are widely applied in many application areas such as artificial intelligence, biology,
customer relationship management, data compression, data mining, information retrieval, image
processing, machine learning, marketing, medicine, pattern recognition, psychology, statistics
and so on.
Clustering methods can be divided into two general classes, designated supervised and
unsupervised clustering. In this paper, we focus on unsupervised clustering which may again be
separated into two major categories: partition clustering and hierarchical clustering. There are
many algorithms for partition clustering category, such as kmeans clustering (MacQueen 1967),
k-medoid clustering, genetic k-means algorithm (GKA), Self-Organizing Map (SOM) and also
graph-theoretical methods (CLICK, CAST). Among those methods, K-means clustering is the
most popular one because of simple algorithm and fast execution speed. Hierarchical clustering
methods are among the first methods developed and analyzed for clustering problems. There are
two main approaches. (i) The agglomerative approach, which builds a larger cluster by merging
two smaller clusters in a bottom-up fashion. The clusters so constructed form a binary tree;
individual objects are the leaf nodes and the root node is the cluster that has all data objects. (ii)
The divisive approach, which splits a cluster into two smaller ones in a top-down fashion. All
clusters so constructed also form a binary tree.
The process of k-means algorithm: This part briefly describes the standard k-means
algorithm. K-means is a typical clustering algorithm in data mining and which is widely used for
clustering large set of data’s. It was one of the most simple, non-supervised learning algorithms,
which was applied to solve the problem of the well-known cluster. It is a partitioning clustering
algorithm, this method is to classify the given date objects into k different clusters through the
iterative, converging to a local minimum. So the results of generated clusters are compact and
independent. The algorithm consists of two separate phases. The first phase selects k centers
randomly, where the value k is fixed in advance. The next phase is to take each data object to the
nearest center. Euclidean distance is generally considered to determine the distance between each
data object and the cluster centers. When all the data objects are included in some clusters, the
first step is completed and an early grouping is done. Recalculating the average of the early
formed clusters. This iterative process continues repeatedly until the criterion function becomes
the minimum. Supposing that the target object is x, xi indicates the average of cluster Ci,
criterion function is defined as follows (eq. 1.):

(1)
E is the sum of the squared error of all objects in database.
The distance of criterion function is Euclidean distance, which is used for determining the
nearest distance between each data objects and cluster center. The Euclidean distance between
one vector x=(x1 ,x2 ,…xn) and another vector y=(y1 ,y2 ,…yn ), The Euclidean distance d(xi,
yi) can be obtained as follow:

(2)
The k-means clustering algorithm always converges to local minimum. Before the k-
means algorithm converges, calculations of distance and cluster centers are done while loops are
executed a number of times, where the positive integer t is known as the number of k-means
iterations. The precise value of t varies depending on the initial starting cluster centers. The
distribution of data points has a relationship with the new clustering center, so the computational
time complexity of the k means algorithm is O (nkt). n is the number of all data objects, k is the
number of clusters, t is the iterations of algorithm. Usually requiring k <<n and t <<n[1]. The
reason behind choosing K-means algorithm to study is its popularity for the following reasons:
Its time complexity is O (nkl), n is number of patterns ,k is the number of clusters, l is
the number of iterations taken by the algorithm to converge.
It is order independent, for a given initial seed set of cluster centers, it generates the same
partition of the data irrespective of the order in which the patterns are presented to the algorithm.
Its space complexity is O (n+k).It requires additional space to store the data matrix.
REFERENCES

[1] Franck Le Duff, CristianMunteanb, Marc Cuggiaa and Philippe Mabob, “Predicting Survival
Causes After Out of Hospital Cardiac Arrest using Data Mining Method”, Studies in Health
Technology and Informatics, Vol. 107, No. 2, pp. 1256-1259, 2004.
[2] W.J. Frawley and G. Piatetsky-Shapiro, “Knowledge Discovery in Databases: An Overview”,
AI Magazine, Vol. 13, No. 3, pp. 57-70, 1996.
[3] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee and Keun Ho Ryu, “Associative
Classification Approach for Diagnosing Cardiovascular Disease”, Intelligent Computing in
Signal Processing and Pattern Recognition, Vol. 345, pp. 721-727, 2006.
[4] Latha Parthiban and R. Subramanian, “Intelligent Heart Disease Prediction System using
CANFIS and Genetic Algorithm”, International Journal of Biological, Biomedical and Medical
Sciences, Vol. 3, No. 3, pp. 1-8, 2008.
[5] Sellappan Palaniappan and Rafiah Awang, “Intelligent Heart Disease Prediction System
using Data Mining Techniques”, International Journal of Computer Science and Network
Security, Vol. 8, No. 8, pp. 1-6, 2008
[6] Shantakumar B. Patil and Y.S. Kumaraswamy, “Intelligent and Effective Heart Attack
Prediction System using Data Mining and Artificial Neural Network”, European Journal of
Scientific Research, Vol. 31, No. 4, pp. 642-656, 2009.
[7] Nidhi Singh and Divakar Singh, “Performance Evaluation of K-Means and Hierarchal
Clustering in Terms of Accuracy and Running Time”, Ph.D Dissertation, Department of
Computer Science and Engineering, Barkatullah University Institute of Technology, 2012.
[8] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power of Text Mining”,
Communication of the ACM. 49(9), 77-82, 2006.
[9] Jiawei Han M. K, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, An
Imprint of Elsevier, 2006
[10] Huang Z, “Extensions to the k-means algorithm for clustering large data sets with
categorical values,” Data Mining and Knowledge Discovery, Vol.2, pp:283–304, 1998
[11] A comparison of antiarrhythmic-drug therapy with implant-able defibrillators in patients
resuscitated from near-fatalventricular arrhythmias. The Antiarrhythmics versusImplantable
Defibrillators (AVID) Investigators. N Engl JMed. 1997 Nov 27;337(22):1576-83.
[12] R.Wu, W.Peters, M.W.Morgan, “The Next Generation Clinical Decision Support: Linking
Evidence to Best Practice”, Journal of Healthcare Information Management. 16(4), pp. 50-55,
2002.
[13] Mary K.Obenshain, “Application of Data Mining Techniques to Healthcare Data”, Infection
Control and Hospital Epidemiology, vol. 25, no.8, pp. 690–695, Aug. 2004.
[14] G.Camps-Valls, L.Gomez-Chova, J.Calpe-Maravilla, J.D.MartinGuerrero, E.Soria-Olivas,
L.Alonso-Chorda, J.Moreno, “Robust support vector method for hyperspectral data classification
and knowledge discovery.” Trans. Geosci. Rem. Sens. vol.42, no.7, pp.1530–1542, July.2004.
[15] Obenshain, M.K: “Application of Data Mining Techniques to Healthcare Data”, Infection
Control and Hospital Epidemiology, 25(8), 690–695, 2004.
[16] Wu, R., Peters, W., Morgan, M.W.: “The Next Generation Clinical Decision Support:
Linking Evidence to Best Practice”, Journal Healthcare Information Management. 16(4), 50-55,
2002.
[17] Charly, K.: “Data Mining for the Enterprise”, 31st Annual Hawaii Int. Conf. on System
Sciences, IEEE Computer, 7, 295-304, 1998.
[18] Blake, C.L., Mertz, C.J.: “UCI Machine Learning Databases”,
http://mlearn.ics.uci.edu/databases/heart-disease/, 2004.
[19] Mohd, H., Mohamed, S. H. S.: “Acceptance Model of Electronic Medical Record”, Journal
of Advancing Information and Management Studies. 2(1), 75-92, 2005.

You might also like