You are on page 1of 13

BATCH_13_HEART DISEASE PREDICTION

ABSTRACT:

I have chosen this problem statement because, in the present day, cardiovascular(heart)
diseases are the leading cause of death worldwide. As per the World Health
Organization(WHO) report, we are noticing 17.9 million deaths are there per year due to heart
diseases. Various unhealthy activities lead to an increase in the risk of heart diseases like
high cholesterol, obesity, hypertension, diabetes, and common things -like the person having
sleep issues, irregular heartbeat, swollen legs, and weight gain. These are the symptoms that
are occurred due to another reason also. So it is very difficult to predict the correct accuracy.
Nowadays there are various computer technologies, that could be used to doing the correct
diagnosis of patients. By this, we can detect the disease to stop it. Hybrid machine learning,
Deep Learning, ML techniques like Naive Bayes, Random forest, KNN, Decision tree, logistic
regression, classification, etc, and also various data mining techniques are there to predict heart-
related diseases. But there are some drawbacks to these techniques, the -prediction of
cardiovascular disease results is not accurate, data mining techniques do not help to provide
effective decision making, and they -cannot handle enormous datasets for patient records. Our
proposed system is logistic regression. Logistic regression is a statistical and machine learning
technique classifying records of a dataset based on the values of the input fields.. we have used
logistic regression to predict the accuracy of heart disease because logistic regression is very easy
to implement and very efficient to train. It is very fast at classifying unknown records. In our
project of Heart disease prediction, we used one of the Machine learning techniques known as
logistic regression. The main aim of our project is efficiency in predicting heart disease rates.
We predicted heart disease rate by using confusion matrix, accuracy, cross-validation
score, classification matrix,log-loss, and F1 score, and their accuracies are Confusion
matrix True Positive is 28 and True Negative is 4. And the False Positive came out to be 8
and the False Negative is 20, cross-validation score of 85.77%, an accuracy of 80%, an
F1 score of 76.92%, a classification report with an accuracy of 78%, and a log-loss with
53.54%. For heart disease prediction, We used logistic regression by using machine
learning algorithms for the prediction of the heart. We used a dataset from Kaggle. For the
heart disease UCI dataset, we got the accuracy for logistic regression of 80.00%. Also, the
feature extraction steps played a crucial role in giving us accuracy. For Performance
measurement, we used classification report, accuracy, precision, log-loss, f1-score, and
confusion matrix. Our project is much better in terms of accuracy and performance
measurement than previous research. It reduces the time complexity of heart disease for
doctors. For more information, we may investigate deep learning to see better results.

KEYWORDS:

 Supervised
 Logistic regression
 Confusion matrix
 Accuracy
 Prediction
 Cross-validation score
 Classification report
 F1-score
 Log –loss

INTRODUCTION:
Machine learning: It is the scientific study of algorithms and statistical models that
computer systems used to perform a specific task without using the explicit instrument,
replying to patients and inference instead. Machine learning (ML) is a subset of (AI).
The heart plays important role in our body. The heart is presented directed behind and
slightly left to the backbone. It is also called cardiovascular disease. Nowadays, Heart disease
is the most life-threatening disease in the world. Risk of the people having several symptoms
is high blood pressure, obesity, cholesterol, diabetes, etc. The heart disease dataset contains
different features such as age, sex, blood pressure, cholesterol, chest type, and so on, etc.
Data-set contains 14 columns (13 features and 1 target).datasets are now available for
analysis and extracting crucial information from it. So, To predict heart disease datasets are
now available for analysis and to extract crucial information from it. Then predicting heart
disease at its early stage by applying ml algorithms to this large amount of data can extract
features that we will extract from datasets. Various machine learning techniques for heart
disease UCI are Logistic regression, K-Nearest Neighbour (KNN), Naive Bayes(NB),
Classification, Support Vector Machine(SVM), etc. We can use the Logistic Regression
model for this heart disease data set. So, we can use predicting heart disease means to classify
whether a person a having cardiovascular disease or not, After applying them to feature
extraction from datasets. Different algorithms will give different accuracy, and error rates.
After comparing them we can need to find the best algorithms that predicted heart disease
with the highest accuracy. The main goal of the heart disease prediction project is efficiency
in predicting heart disease rates.
Machine Learning Algorithm:
Logistic Regression:
Logistic regression is the most efficient machine learning algorithm.LR is also called binary
classification. It works on the dependent variables are categorical. The variables are binary
dependent variables such as 0’s and 1’s or pass and fail or yes and no, etc. If variables are
having two or more no of outcomes, then multi-nominal logistic regression. If there are
having ordered multiple categories then use ordinal regression.
The Logistic regression function is
P=(y=1/x)=1/1+e^-wa
Where e is the numerical constant Euler’s number
a is an input we put into the function
LITERATURE SURVEY:

In [1] Shah et al, Heart disease is the leading cause of death in the world over the past
decades. It is also called cardiovascular disease. Several different symptoms are associated
with heart disease. This issue can be resolved by adopting machine learning techniques. Heart
disease is the leading cause of death among all other diseases, even cancer. The diagnosis of
heart disease is a challenging task, which offers an automated prediction about the heart
condition of the patient so that further treatment can be effective. The diagnosis of heart
disease is usually based on the signs, and symptoms of the patient. The author suggested to
the severity of the heart disease is based on various methods like the navies Bayes classifier,
decision trees, k-nearest neighbor, and random forest. The author achieved an accuracy of
navies Bayes classifier, decision tree, k nearest neighbor and random forest are 83.49%,
71.43%, 83.16%, and 91.6%. The highest accuracy of 91% using random forest. The lowest
accuracy of 71.43% by using the decision tree.

In [2] S. K. J. et al, the Author proposed to identify heart disease by using machine learning
techniques. It’s deals with the impact of gender on heart disease a comparison between males
and females. It is a collected data-set from the UCI machine learning repository, It contains
14 parameters like age, sex, blood pressure, etc. Machine learning algorithms are the decision
tree classifier and the navies Bayes classifier. The highest accuracy of 91% by using the
Decision Tree classifier and the lowest accuracy of 87% by using the Navies Bayes Classifier.

In [3] Arumugaraj et al, Author deal with machine learning algorithms such as logistic
regression and the navies Bayes algorithm for the prediction of heart disease. In the first
algorithm, the logistic regression is built by using certain conditions which give True and
False or Yes and No. Other algorithms like navies Bayes, random forest, support vector
machine, gradient boosting, and accuracy module are based on vertical or horizontal split
conditions depending upon dependent variables. The accuracy of logistic regression, random
forest, navies Bayes, gradient boosting, and support vector machine are 91.6%, 89.5%, 90.9%,
90.7%, and 88.2%. The highest accuracy of 90.9% by using random forest, and also the
lowest accuracy 0f 88.2% by using a support vector machine.

In [4] Archana Singh et al, proposed that the Heart is one of the most important and vital
organs of the human body. so the care of the heart is essential. Prediction of heart diseases is
very necessary and this comparative study is needed. At present days, most patients are died
because of heart diseases. Machine Learning is one of the efficient technology for testing,
which is based on training and testing. Machine learning is a specific branch of AI. On the
other hand, machine learning systems are trained to learn how to process and make use of
data. As the definition of machine learning, is it learns from natural phenomena, and natural
things. The severity of heart disease is based on the algorithms in this project we have used
four algorithms which are decision tree, linear regression, k-neighbour, and SVM.

Machine Learning is one efficient technology that is based on two terms one is testing and the
other is training i.e. system performs training based on experience. Support vector machine
gives the accuracy of 83%, decision tree gives the accuracy of 79 %, linear regression gives
the accuracy of 78%, k--nearest-neighbor gives the accuracy of 87%. The highest accuracy
of k-nearest-neighbour is 87%. It is the best among them with 87% accuracy.

In [5], Mohan, Senthilkumar, et al, suggested that Heart disease is one of the most significant
causes of death in the world today. Prediction of cardiovascular disease(heart disease) is a
critical challenge. Machine learning (ML) is more effective in making decisions and
predictions from the large quantity of data produced by the healthcare industry. In this paper,
we apply machine learning techniques resulting in improving the accuracy in the prediction
of cardiovascular disease. The prediction model is introduced with different combinations of
features and several known classification techniques.

It is difficult to identify heart disease because of several risk factors such as diabetes, high
blood pressure, high cholesterol, abnormal pulse rate, and many other factors. There are
various techniques to find out the severity of heart disease among humans. The heart disease
will be handled carefully The severity of the disease is classified based on various methods
like naive Bayes, generalized linear regression, logistic regression, deep learning, decision
tree, random forest, gradient boosting tree, and support vector machine are 75.8%, 85.1%,
82.1%, 87.4%, 85%, 86.1%, 78.3%, 86.1%. The highest accuracy is 87.4% using deep
learning, and the lowest accuracy is 75.8% using navies Bayes.

TABLE1: ANALYTICAL APPROACH TO EXISTING METHODOLOGIES

SNO AUTHOR ALGORITHMS MERITS DEMERITS accuracy


1 Shah et al K-Nearest Neighbor Increased The prediction 83.16%
Algorithm (KNN), accuracy of heart disease
Decision trees (DT), of the results is not 42.89%
effective accurate.
Navies Bayes(NB) 83.49%
heart
Random Forest Algorithm patient. 91.6%
2 S. K. J. et al Decision trees (DT), Reduce Cannot handle 91%
the time an enormous
Navies Bayes (NB). complexit dataset for a 87%
y of patient.
doctors.
3 Arumugaraj Logistic regression Handles Machine 86.5%
et al the learning
Random forest roughest technique does 80.89%
amount of not help to
Naïve Bayes data using provide 84.26%
SVM and effective
Gradient boosting feature decision 84.26%
selection. making.
SVM 79.775%
4 Archana et Linear regression Does not 78%
al Support Vector Machine perform very 83%
relatively well when the
K-nearest Neighbour memory data set has 87%
efficient more noise i.e.
Random forest target classes 86%
are overlapping.
SVM 86%
5 Senthilkuma Decision tree Cost- Training is 82.1%
r et al Learning model effective relatively 85%
Support vector machine for the expensive as the 75.8%
Random forest patient. complexity and 87.6%
time took are
more.

PROPOSED METHODOLOGY:

Fig 1: Heart Disease UCI Architecture

Data Set: A collection of data is called Dataset.


Fig2: Heart Disease UCI.CSV

Dataset source: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-


data\

Heart disease UCI have 14 columns are age,sex,chest pain type,Resting blood
pressure,cholesterol,fasting blood sugar,rest ECG,Max heart rate,exercise-induced angina,old
peak,slope,vessels colored by fluoroscopy,thalassemia,target.

Preprocessing:

In general, a raw dataset is usually full of errors and null values. So, it can not be directed use
to work on the model. Using a Logistic Regression model by removing the unnecessary
information and error part and full the null values.

Feature Selection:

Dimensionality reduction is one form of feature extraction. The process of extracting target-
related information from the given feature sets feature dataset and target can the relevant
feature and irrelevant feature are not only waste the compute resources and also introducing
unwanted and unnecessary noise data. PCA is the best method for feature selection.

Logistic Regression:

LR means Logistic Regression. A Logistic regression model analysis the connection between
one or more existing independent variables to determine dependent data variables.

Splitting data:
1) Training data

2) Testing data

Prediction :

It’s time to test whether our model is trained well or not.

Evaluating the model:

A)Cross-validation score:

Train our model using the subset and evaluate the complementary subset of the data.
The cross-validation score is 85.77%.

B)Accuracy:

The ratio of the true positives and true negatives to all positive and negative
observations. Accuracy is 80%

C)F1-score:

Computing the accuracy, precision, and recall of our model of Logistic Regression. F1-
score is 76.92%

D)Confusion matrix:

The Confusion matrix True Positive is 28 and True Negative is 4. And the False
Positive came out to be 8 and the False Negative is 20.

E)Classification Report:

The Classification Report of the Logistic Regression is 83% in the absence of heart
disease UCI was predicted correctly. 78% of present heart disease UCI was predicted will
correct.

F)Log-Loss:

The cross-entropy of error between the two probability distribution. The Log-Loss is
53.54%.

RESULTS AND DISCUSSIONS:


Fig3:Distribution of population based on Gender and Age

pie chart showing the percentage distribution of gender. 68% (blue) of the participants were
male and 32% (green) of the participants were female. N=100.An Age distribution, also
called Age Composition, in population studies, is the proportionate number of persons in
successive age categories in a given population.

Fig4:Types of chest pain of normal and heart patients

Chest pain appears in many forms, ranging from a sharp to a dull ache. Sometimes chest pain
feels crushing or burning. In certain cases, the pain travels up the neck, into the jaw, and to
other parts of the body. Many different problems can cause chest pain. The above graph
shows different types of chest pain in normal persons and heart patients.
Fig5: ECG Distribution

ECG stands for “Electrocardiogram”. An electrocardiogram (ECG) is a simple test that can
be used to check your heartbeat. Sensors attached to the skin are used to detect the electrical
signals produced by your heart.

Fig 6:Distribution of Resting_bloodpressure based on Cholesterol and


Age

Two main reasons people have heart disease or stroke are high blood pressure and cholesterol.
Adults with high cholesterol and with high blood pressure don’t have their conditions yet
under control. High blood pressure means at least 140/90 mmHg. High cholesterol in this
report means high LDL (“bad”) cholesterol. The other reason for heart disease is their age.
Increasing age results in high blood pressure, which leads to heart disease and sometimes
stroke.
Fig7:Correlation with Diabetes

High blood glucose from diabetes can damage your blood vessels and the nerves that control
your heart and blood vessels. This damage can lead to heart disease. Cardiovascular(heart)
diseases are especially common among people with diabetes.

Fig8:Cross Validation score

Train our model using the subset and evaluate the complementary subset of the data.
The cross-validation score is 85.77%.

Fig9:Accuracy score

The ratio of the true positives and true negatives to all positive and negative observations.
Accuracy is 80%
Fig10:Confusion matrix

The Confusion matrix True Positive is 28 and True Negative is 4. And the False
Positive came out to be 8 and the False Negative is 20.

Fig11:Result of Classification_report

The Classification Report of the Logistic Regression is 83% in the absence of heart
disease UCI was predicted correctly. 78% of present heart disease UCI was predicted will
correct.

Fig12:Log loss

The cross-entropy of error between the two probability distribution. The Log-Loss is

53.54%
Fig13:F1 score

Computing the accuracy, precision, and recall of our model of Logistic Regression. F1-
score is 76,92%.

CONCLUSION:

For heart disease prediction, We used logistic regression by using machine learning
algorithms for the prediction of the heart. We used a dataset from Kaggle. For the heart
disease UCI dataset, we got the accuracy for logistic regression of 80.00%. Also, the feature
extraction steps played a crucial role in giving us accuracy. For Performance measurement,
we used classification report, accuracy, precision, log-loss, f1-score, and confusion matrix.
Our project is much better in terms of accuracy and performance measurement than previous
research. It reduces the time complexity of heart disease for doctors. For more information,
we may investigate deep learning to see better results.

REFERENCES: [1] Shah, D., Patel, S. & Bharti, S.K. Heart Disease Prediction using
Machine Learning Techniques. SN COMPUT. SCI. 1, 345 (2020).
https://doi.org/10.1007/s42979-020-00365-y

[2] S. K. J. and G. S., "Prediction of Heart Disease Using Machine Learning Algorithms,"
2019 1st International Conference on Innovations in Information and Communication
Technology (ICIICT), 2019, pp. 1-5, DOI: 10.1109/ICIICT1.2019.8741465.

[3] Dinesh, Kumar G; Arumugaraj, K; Santhosh, Kumar D; Mareeswari, V (2018). [IEEE


2018 International Conference on Current Trends towards Converging Technologies
(ICCTCT) - Coimbatore, India (2018.3.1-2018.3.3)] 2018 International Conference on
Current Trends towards Converging Technologies (ICCTCT) - Prediction of Cardiovascular
Disease Using Machine Learning Algorithms., (), 1–7. doi:10.1109/ICCTCT.2018.8550857

[4] Singh, Archana; Kumar, Rakesh (2020). [IEEE 2020 International Conference on
Electrical and Electronics Engineering (ICE3) - Gorakhpur, India (2020.2.14-2020.2.15)]
2020 International Conference on Electrical and Electronics Engineering (ICE3) - Heart
Disease Prediction Using Machine Learning Algorithms., (), 452–457.
doi:10.1109/ICE348803.2020.9122958

[5] Mohan, Senthilkumar; Thirumalai, Chandrasegar; Srivastava, Gautam (2019). Effective


Heart Disease Prediction using Hybrid Machine Learning Techniques. IEEE Access, (), 1–1.
doi:10.1109/ACCESS.2019.2923707

You might also like