Positif Journal Issn No : 0048-4911
TO PREDICT HEART-ATTACK OF A
PERSON USING ML TECHNIQUES
G BELSHIA JEBAMALAR1, J A ADLIN LAYOLA2, J BIBIJA3,JSAVIJA4
Assistant Professor, Department of Computer Science Engineering, S.A. Engineering College, Chennai, Tamil
Nadu, India1
Assistant Professor, Department of Computer Science Engineering, Loyola Institute of Technology, Chennai, Tamil
Nadu, India2,3,4
Abstract:Heart disease is a devastating human precise,dependable, and practical methods for making
disease that is on the rise in both industrialised and an early diagnosis and managing the condition. The
developing countries, resulting in death.In this proposed method entails developing a machine
disease, the heart normally fails to provide enough learning model capable of determining whether or not
blood to other regions of the body to allow them to a person has cardiac disease. The best model is used
perform their regular functions.It links a slew of risk to forecast the outcome after two different algorithms
factors for heart disease with a pressing need for KNN and Logistics Regression are compared.
is the initial phase in the process of requirements
Keywords- Heart disease, Machine Learning analysis. It is a list of a software system's needs. The
model,KNN,Logistic regression following information is for specific libraries like as
sk-learn, pandas, numpy, matplotlib, and seaborn.The
I. INTRODUCTION procedure starts with data collection, where Kaggle is
used to collect previous datasets relating to heart
The heart is the major organ that pumps blood disease. The data analysis is carried out on the dataset
throughout the body and ensures that it functions with proper variable identification, i.e., both
properly. Heart disease is a devastating human dependent and independent variables are discovered.
disease that is on the rise in both industrialised and Data cleaning and processing, missing value analysis,
developing countries, resulting in death. In this exploratory analysis, and model creation and
disease, the heart normally fails to provide enough evaluation were all part of the analytical process.On
blood to other regions of the body to allow them to the dataset where the data pattern is learned, two
perform their regular functions. It links a slew of risk different techniques, Logistic Regression (LR) and
factors for heart disease with a pressing need for K-Nearest Neighbors (KNN), are used. Following the
precise, dependable, and practical methods for use of two different algorithms, a better algorithm is
making an early diagnosis and managing the utilised to predict the outcome.
condition. In the healthcare industry, data mining is a This machine learning model was created with
typical technique for analysing large amounts of data. Python's Jupyter Notebook and Flask and trained on
To analyse large complicated medical data, Kaggle's standard dataset. Accuracy, Precision,
researchers use a variety of data mining and machine Sensitivity, Specificity, F Measure, and Error are all
learning techniques. A machine learning (ML) metrics used to evaluate the suggested model. The
algorithm, which is part of AI, employs a variety of proposed model had a higher level of precision and
precise, probabilistic, and improved strategies to accuracy. The experimental findings of the proposed
learn from thepast and recognise difficult-to-perceive model reveal that it is effective in forecasting heart
patterns in large, noisy, or complex datasets .data that attacks when compared to other current models.
aids in the prediction of heart disease by healthcare
providers .The proposed method entails developing a
machine learning model capable of determining
whether or not a person has cardiac disease.The II. LITERATURE SURVEY
software requirements specification is a technical III. DATA PRE-PROCESSING:
description of the software product's requirements. It
Vol 22, Issue 10, 2022 Page No : 101
Positif Journal Issn No : 0048-4911
Checking for duplicate data
Machine learning validation approaches are used to Checking data frame missing values
calculate the error rate of the Machine Learning (ML) Checking data frame unique values
model, which is as close to the genuine error rate of
Checking the data frame's count values
the dataset as possible. Validation approaches may
not be required if the data volume is large enough to Rename and delete the data frame you've
be representative of the population. However, in real- been given.
world circumstances, it is necessary to work with To determine the type of values
data samples that are not always representative of the To be used To add additional columns
population of a dataset. Duplicate the value and the
data type description to identify the missing value,
whether it is a float variable or an integer variable. IV. DATAVALIDATION/CLEANING/
While tuning model hyper parameters, a sample of PREPARING PROCESS:
data is employed to offer an unbiased evaluation of a
model fit on the training dataset. Importing library packages and loading the specified
This data is used by machine learning specialists to dataset. Identifying variables based on data shape,
fine-tune the model hyper parameters. Data data type, and evaluating missing values and
collection, analysis, and the process of addressing duplicate values. A validation dataset is a sample of
data content, quality, and organisation can be time- data kept back after training your model that is used
consuming. Understanding your data and its to measure model skill when tweaking models and
properties is helpful during the data identification techniques for making the greatest use of validation
phase; this knowledge will assist you choose which and test datasets while evaluating your models. To
algorithm to employ to build your model. evaluate the uni-variate, bi-variate, and multi-variate
A variety of data cleaning jobs utilising Python's processes, data cleaning / preparation is performed by
Pandas library, with a focus on the most common renaming the given dataset and dropping the
data cleaning task, missing values, and the ability to columns, among other things. The methods and
clean data more quickly. It prefers to spend less time techniques for cleaning data will differ depending on
cleaning data and more time analysing and modelling the dataset. The basic purpose of data cleaning is to
it. find and fix mistakes and abnormalities so that data
Some of these sources are simply unintentional may be used for analytics and decision-making.
errors. Other times, there may be a more serious
reason for the lack of data. From a statistical
standpoint, it's critical to comprehend the various
sorts of missing data. The type of missing data will
determine how missing values are filled in, how
missing values are detected, and how simple
imputation and detailed statistical approaches are
used to deal with missing data.
Here are some common explanations for missing
data:
• A field was left blank by the user.
• Data was lost during a manual transfer from a
legacy database.
• A programming error occurred.
• Users declined to fill out a field related to how the
results would be utilised or interpreted based on their
opinions.
Uni-variate, Bi-variate, and Multi-variate analysis are V. EXPLORATION DATA
used to identify variables. ANALYSIS OF VISUALIZATION:
Import libraries for access and functionality,
and read the given dataset.
In applied statistics and machine learning, data
General Properties of Analyzing the Given
visualisation is a crucial ability. Statistics is
Dataset. concerned with the description and estimation of
Display the Given Dataset in the Form of a quantitative data. Data visualisation is a valuable set
Data Frame. of tools for acquiring a qualitative understanding of
Show Columns. Shape of the Data Frame. data. This might be useful for spotting patterns, faulty
Checking data type and dataset information
Vol 22, Issue 10, 2022 Page No : 102
Positif Journal Issn No : 0048-4911
data, outliers, and other things when exploring and data using resampling approaches like cross
getting validation. It must be able to utilise these estimations
to select one or two of the best models from the set
to know a dataset. Data visualisations can be used to you've built.
express and demonstrate crucial relationships in plots When you have a fresh dataset, it's a good idea to
and charts that are more visceral and meaningful to visualise it using a variety of ways so you can see it
stakeholders than measurements of association or from multiple angles. Model selection follows the
significance with a little subject knowledge. same logic. To choose the one or two to complete,
Data visualisation and exploratory data analysis are you should look at the estimated accuracy of your
fields in and of themselves, and it will be machine learning algorithms in a variety of methods.
recommended that you read some of the books Using various visualisation approaches to display the
indicated at the end for further information. Data may average accuracy, variance, and other features of the
not make sense unless it is presented in a visual distribution of model accuracies is one way to
format, such as charts and graphs. The ability to accomplish this.
visualise data samples and other objects quickly is a In the following part, you'll learn how to do it in
crucial talent in both applied statistics and applied Python using scikit-learn. The key to a fair
machine learning. It will teach you about the many comparison of machine learning algorithms is to
plot types that you'll need to know when visualising ensure that each method is evaluated in the same way
data in Python, as well as how to use them to better on the same data, which can be accomplished by
understand your own data. requiring each algorithm to be evaluated on the same
How to use line plots to visualise test harness.
time series data and bar charts to
visualise categorical data. VII. ALGORITHM AND TECHNIQUES :
How to use histograms and box
plots to summarise data
distributions. Algorithm Explanation:
Classification is a supervised learning strategy in
machine learning and statistics in which a computer
programe learns from the data input supplied to it and
then applies that learning to classify fresh
observations. This data set could be bi-class or
multi-class.Speech recognition, handwriting
recognition, biometric identification, document
classification, and other classification challenges are
examples. Algorithms in Supervised Learning learn
from labelled data. The algorithm selects which label
should be given to new data based on pattern and
associating the patterns to the unlabeled new data
after analysing the data.
Used Python Packages:
VI. COMPARING ALGORITHM WITH sklearn:
• Sklearn is a machine learning package for Python
PREDICTION IN THE FORM OF BEST that includes a variety of machine learning methods. •
ACCURACY RESULT : Some of its modules, such as train test split,
DecisionTreeClassifier or Logistic Regression, and
accuracy score, are used here.
It is critical to compare the performance of various
different machine learning algorithms consistently,
and this tutorial will show you how to develop a test NumPy:
harness in Python using scikitlearn to do so. This test • It's a python numeric module that provides quick
harness can be used as a framework for your own math functions for calculations.
machine learning tasks, with additional and different • It's used to pull data out of numpy arrays and
algorithms to compare. The performance manipulate it.
characteristics of each model will vary. You may
gain an idea of how accurate each model is on unseen
Vol 22, Issue 10, 2022 Page No : 103
Positif Journal Issn No : 0048-4911
Pandas: • The log chances are linearly connected to the
• Data frames can be used to read and write various independent variables.
files, and data manipulation is simple with them. • Logistic regression necessitates a high sample size.
Matplotlib: Expected Output From Given Input
• Data visualisation is a helpful tool for identifying
trends in a dataset, and data manipulation is simple input: data
using data frames. output: obtaining accuracy
IX. K-NEAREST NEIGHBOUR:
System Architecture:
The classification model challenges are solved using
this algorithm. To classify data, the K-nearest
neighbour or K-NN algorithm generates an imaginary
boundary. When new data points are received, the
algorithm will try to predict them as close to the
boundary line as possible.
As a result, a higher k value indicates smoother
separation curves and thus simpler models. Smaller k
values, on the other hand, are more likely to overfit
the data, resulting in more complicated models.
Note:
•When evaluating a dataset, it's critical to use the
appropriate k-value to avoid overfitting and
underfitting.
•We fit the previous data (or train the model) and
VIII. LOGISTIC REGRESSION: predict the future using the k-nearest neighbour
approach.
It's a statistical technique for analysing a data set with KNN accuracy is determined by the distance measure
one or more independent factors that influence the and K value. Cosine and Euclidian distance are two
outcome. A dichotomous variable is used to assess methods of determining the distance between two
the outcome (in which there are only two possible occurrences. KNN computes its K nearest neighbours
outcomes). The purpose of logistic regression is to to evaluate the new unknown sample and assigns a
identify the best-fitting model to represent the class by majority voting.
relationship between a set of independent (predictor
or explanatory) factors and a dichotomous feature of X. DEPLOYMENT :
interest (dependent variable = response or outcome
variable). A Machine Learning classification
approach called logistic regression is used to predict Flask (Web FrameWork) :
the likelihood of a categorical dependent variable.
The dependent variable in logistic regression is a Flask is a Python-based microweb framework.
binary variable that comprises data coded as 1 (yes, Because it does not require any specific tools or
success, etc.) or 0 (no) (no, failure, etc.). libraries, it is categorised as a micro-framework. It
Thus logistic regression model helps to predicts lacks a database abstraction layer, form validation, or
P(Y=1) as a function of X. any other components that rely on third-party
Logistic regression is a technique for predicting libraries to perform common functions. Extensions,
the outcome of Assumptions: on the other hand, can be used to extend the
• Binary logistic regression needs a binary dependent functionality of an application as if it were built into
variable. Flask itself.
• In a binary regression, the dependent variable's Extensions are available for object-relational
factor level 1 should represent the intended outcome. mappers, form validation, upload handling, several
• Only the most crucial aspects should be considered.. open authentication protocols, and other framework-
• The independent variables must not be connected in related features. Armin Ronacher of Pocoo, a
any way. To put it another way, the model should be worldwide association of Python aficionados founded
devoid of anything. in 2004, invented Flask. According to Ronacher, the
concept started as an April Fool's joke that grew in
Vol 22, Issue 10, 2022 Page No : 104
Positif Journal Issn No : 0048-4911
popularity enough to be turned into a legitimate False Negatives (FN):
application. The name is a pun on the Bottle structure A defaulter is expected to be a payer. When the
from before. projected class is no, but the actual class is yes. For
Flask is a BSD-licensed Ruby framework built on example, if the passenger's actual class value
Werkzeug and Jinja2.. Flask is a relatively new indicates that he or she survived, while the predicted
Python framework compared to the majority of class value implies that the passenger will perish.
Python frameworks, yet it has already garnered True Positives (TP):
popularity among Python web developers. Let's take A defaulter is someone who does not pay their bills.
a deeper look at Flask, the so-called "micro" Python These are successfully predicted positive values,
framework. indicating that the value of the real class is yes, as
EXPECTED OUTPUT FROM INPUT well as the value of the anticipated class. For
data values as input example, if the actual class value indicates that this
output: forecasting results passenger survived and the anticipated class also
suggests that this passenger survived.
True Negatives (TN):
A person who defaults is likely to be a payer. These
are accurately predicted negative values, indicating
that the value of the real class is zero and the value of
the projected class is zero as well. For example, if the
real class states the passenger did not survive and the
forecast class says the same.
True Positive Rate(TPR) = TP / (TP + FN) False
Positive rate(FPR) = FP / (FP + TN)
Accuracy:
The percentage of total predictions that are correct;
XI. PREDICTION RESULT BY alternatively, how often the model properly predicts
ACCURACY: defaulters and nondefaulters.
Accuracy calculation:
A linear equation with independent predictors is also Accuracy = (TN+TP) / (TP + TN + FN + FP)
used in the logistic regression process to predict a Precision:
value. The anticipated value ranges from negative The percentage of optimistic predictions that are
infinity to positive infinity. The algorithm's output accurate.
must be classed as variable data. By comparing the Precision = TP / (TP + FP)
best accuracy, the logistic regression model has a Recall:
higher accuracy in predicting the outcome. The percentage of observed positive values that were
The adjustments we apply to our data before feeding accurately predicted. (The percentage of real
it to the algorithm are referred to as pre-processing. defaulters predicted properly by the model)
Unclean data is converted into a clean data collection Recall = TP / (TP + FN)
through data preparation. In other words, anytime
data is received from various sources, it is collected The average of Precision and Recall helps to get F1
in raw format, which makes analysis impossible. To Score. As a result, this score takes both false
get better results from the used model in Machine positives and false negatives into account. Despite
Learning, the data must be organised properly. Null the fact that it is less apparent, F1 is frequently more
values must be managed from the original raw data valuable than precision. This is especially true if the
collection since a specific Machine Learning model distribution of social classes is unequal. Accuracy
requires information in a specific format. Another works best when the costs of false positives and false
consideration is that the data set be written in such a negatives are equal. It's best to examine both
Precision and Recall if the cost of false positives and
way that many Machine Learning and Deep Learning false negatives differs significantly.
methods can be used in a single dataset.
False Positives (FP): General Formula:
A individual who is expected to pay is referred to as a
defaulter. when the expected class is yes and the F- Measure = 2TP / (2TP + FN + FP)
actual class is no. for example, if the actual class
indicates that this passenger did not survive, but the F1-Score Formula:
forecast class indicates that this passenger will.
Vol 22, Issue 10, 2022 Page No : 105
Positif Journal Issn No : 0048-4911
F1 Score = 2*( Precision * Recall) / (Recall + [10]Telemonitoring and medical care of heart failure
Precision) patients supported by left ventricular assist devices -
the medolution project,thomas schmidt, nils reiss,et
al(2017).
XII. CONCLUSION :
Data cleaning and processing, missing value analysis,
exploratory analysis, and model creation and
evaluation were all part of the analytical process. The
best accuracy on a public test set will be discovered,
as will the highest accuracy score. This programme
can assist in determining the likelihood of a heart
attack.
REFERENCES
[1]A. S. Abdullah and R. R. Rajalaxmi, ‘‘A data
mining model for predicting the coronary heart
disease using random forest classifier,’’ in Proc. Int.
Conf. Recent Trends Comput. Methods, Commun.
Controls, Apr. 2012, pp. 22–25.
[2] Heart disease diagonasis and prediction using
machine learning and data mining technique,
animesh hazra, subrata kumar mandal, amit gupta, et
al (2017).
[3]Hospital readmissions after continuous-flow left
ventricular assist device implantation: incidence,
causes, and cost analysis, shahab a akhter, abbasali
badami, margaret murray,et al(2015).
[4]Heart disease prediction using effective machine
learning techniques , avinash golande, pavan kumar t,
et al(2019) .
[5]Effective prediction of cardiovascular disease
using cluster of machine learning algorithms,g.
Jignesh chowdary1 , suganya. G 2 , premalatha,et al
(2020).
[6]Hospital readmissions after continuous-flow left
ventricular assist device implantation: incidence,
causes, and cost analysis,shahab a akhter, abbasali
badami ,et al(2020) .
[7]The effect of change in flow rate on the vibration
of double-suction centrifugal pumps,m r
hodkiewicz ,m p norton,et al(2002).
[8]Review of heart disease prediction system using
data mining and hybrid intelligent techniques, r.
Chitra1 and v. Seenivasagam2,et al(2013).
[9]Acoustic characterization of axia flow left
ventricular assist device operation in vitro and in vivo
,gardner l yost, thomas j royston, geetha bhat, antone
j tatooles,et al(2016).
Vol 22, Issue 10, 2022 Page No : 106