You are on page 1of 42

MID TERM REPORT OF

CardiovascularaDiseaseaPredictionaUsing
MachineaLearningaModels

AaGraduateaProjectaReportasubmittedatoaManipalaAcademyaofaHigher
Educationainapartialafulfilmentaofathearequirementaforatheaawardaof the
degreeaof

BACHELORaOFaTECHNOLOGY
aIn

Electronics and Communication Engineering

Submitted by
Prajwal P Prabhu
200907012

Under the guidance of

Dr Diana Olivia Dr Vinod Kumar Joshi


Associate Professor & Additional Professor
Dept of ICT, Manipal Dept of ECE, Manipal
Institute of Technology Institute of Technology

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MANIPAL-576104, KARNATAKA, INDIA


MARCH/APRIL 2024

1
ABSTRACT

CardiovascularadiseasesC(CVDs) pose aAsignificantAglobalAhealth challenge, necessitating


advanced methods for early detection and risk assessment. In response to this urgent need, our
project focuses on leveraging machine learning models to predict CVDs with heightened
accuracy and reliability. With the prevalence of CVDs on the rise, the importance of proactive
healthcare interventions and precise diagnostic tools cannot be overstated. Our objective issto
explore a diverse setlof 20 machinedlearning models toaidentify the most effective predictive
model, thereby contributing to the advancement of cardiovascular health diagnostics.

The methodology adopted in our project involves a systematic approach encompassing data
exploration,apreprocessing,dmodel selection,otraining, andaevaluation.fWe begin by
acquiring a rich dataset from Kaggle, comprising clinical and demographic information related
to cardiovascular health. This dataset undergoes rigorous preprocessing to ensure standardized
input for training and validation of machine learning models. Subsequently, we explore a
comprehensive range of 20 diverse models, including classical algorithms and advanced
techniques, to discern the optimal predictive model for CVDs.

Through meticulous experimentation and evaluation, our project yields important results
regarding therperformanceoof various machineplearningpmodels inppredicting cardiovascular
diseases. By employing a robust set of evaluationometricslsuchsaspaccuracy,lprecision,erecall,
andoareaiunderotheiROCpcurve, we gain insightstintoieachlmodel's strengths and weaknesses.
Additionally, we assess the significance of feature importance and visualize the comparative
results tosprovide atcomprehensive understanding ofimodeloperformance. These findings hold
significance in informing healthcare practitioners about effective predictive analytics tools for
cardiovascular health.

In conclusion, our project contributes to the ongoing advancements in healthcare analytics by


identifying the most effective predictive model for cardiovascular diseases. By leveraging
machine learning techniques, we enhance the diagnostic precision and facilitate early
interventions, thereby improving patient outcomes and optimizing healthcare resources. The
comprehensive analysis undertaken in this project underscores the transformative potential of
machine learning in augmenting cardiovascular health diagnostics. (Software tools/packages
used: PythonulibrariesuincludingoPandas,qMatplotlib,iSeaborn,oScikit-learn)

2
Contents
Page No
Abstract 2

Chapter 1
1.1 Introduction 4-5
1.2 Motivation 6-7
1.3 Organization of Report 7-8

Chapter 2 10-14
2.1 Background theory 10-12
2.2 Literature review 13-14

Chapter 3 15-17
3.1 Schematics & Design 15-16
3.2 Methodology 16-17

Chapter 4 Project prerequisites 18-22


4.1 Import Kaggle dataset and required libraries 18
4.2 EDA 19
4.3 Preparing for modelling 19-22

Chapter 5 ML models 23-40


5.1 Logistic Regression 23-24
5.2 Support Vector Machines 25-27
5.3 Linear SVC 27-29
5.4 k-Nearest Neighbours algorithm with GridSearchCV 29-30
5.5 Naive Bayes 31-32
5.6 Perceptron 32-33
5.7 Stochastic Gradient Descent 34-35
5.8 Decision Tree Classifier 36-38
5.9 Random Forests with GridSearchCV 38-40

Chapter 6 Result Analysis 41

Chapter 7 Conclusion and future scope of work 42


References and bibliography 43
PROJECT DETAILS 44

3
CHAPTER 1

1.1 INTRODUCTION

Cardiovascularidiseases (CVDs)prepresent aosignificantqglobal healthqburden,aaccounting


forlaesubstantialynumberqof deaths worldwide. With the prevalence of CVDs on the rise, there
is a critical need for advanced diagnostic tools and predictive models to enable early detection
and intervention. In light of this pressing healthcare challenge, our project endeavors torexplore
thewpotential of machineplearning techniques in predictingrcardiovascular diseases with
heightened accuracy and reliability. The urgency of addressing CVDs stems from their
multifaceted impact on individuals, families, and healthcare systems worldwide. Beyond the
staggering mortality rates associated with CVDs, there are profound socio-economic
implications, including healthcare costs, loss of productivity, and diminished quality of life for
affected individuals. Against this backdrop, the development of effective predictive models
holds immense promise in mitigating the burden of CVDs by enabling timely interventions and
targeted preventive measures.

Our project unfolds against the backdrop of a rapidly evolving healthcare landscape
characterized by the integration of data-driven technologies and personalized medicine
approaches. Machine learning, in particular, has emerged as a powerful tool in healthcare
analytics, offering insights into complex disease patterns, treatment responses, and prognostic
outcomes. By harnessing the predictive capabilities of machine learning models, we seek to
empower healthcare practitioners with actionable insights for improving patient outcomes and
shaping proactive healthcare strategies. Predictive modeling in the context of cardiovascular
diseases involves the development of algorithms and computational models capable of
identifying individuals at heightened risk of developing CVDs based on a combination of
clinical, demographic, and lifestyle factors. By leveraging large-scale datasets and advanced
analytical techniques, predictive models offer the potential to uncover subtle patterns,
interactions, and risk factors that may not be readily apparent through conventional approaches.
Moreover, they facilitate the integration of diverse sources of information, including genetic
data, biomarkers, and medical imaging, toeenhance theiaccuracypand granularity oforisk
predictiont.

Theppresent-day scenario regarding cardiovascular diseases is characterized by a confluence


of demographic, epidemiological, and technological trends that underscore the urgency of
developing effective predictive models. Demographically, aging populations and changing
lifestyles contribute to an increasing prevalence of cardiovascular risk factors such as
hypertension, obesity, and diabetes. Furthermore, disparities in access to healthcare and

4
preventive services exacerbate the burden of CVDs, particularly among underserved
communities and marginalized populations.

In tandem with these demographic shifts, advances in technology and data science offer
unprecedented opportunities to revolutionize cardiovascular care and prevention. The
proliferation of electronic health records (EHRs), wearable devices, and remote monitoring
technologies generates vast amounts of data that can be leveraged for predictive modeling and
risk stratification. Machine learning algorithms, in particular, have shown promise in analyzing
complex datasets, identifying predictive features, and generating actionable insights for
healthcare providers and policymakers.

Against this backdrop, the development and validation of predictive models for cardiovascular
disease prediction emerge as a critical imperative for public health and clinical practice. These
models have the potential to inform targeted interventions, optimize resource allocation, and
empower individuals to take proactive steps towards mitigating their cardiovascular risk.
However, realizing this potential requires interdisciplinary collaboration, robust validation
studies, and careful consideration of ethical and regulatory implications.

In summary, the present-day scenario surrounding cardiovascular diseases underscores the


need for innovative approaches to risk assessment and prevention. Predictive modeling
represents a promising avenue for addressing this challenge byiharnessing theopower ofwdata
analytics and machine learning to identify individuals at heightened risk of cardiovascular
events. By contextualizing the project within this broader landscape, we aim toocontributeqto
thetongoingrefforts tooenhance cardiovascular health outcomes and reduce the burden of
CVDs on a global scale.

Fig 1.1 cardiovascular disease

5
1.2 MOTIVATION

The motivation behind this project stems from recognizing the shortcomings in previous work
and the pressing need for innovative solutions in the context of cardiovascular disease (CVD)
prediction. Previous studies in this area have often been limited by factors such as reliance on
traditional diagnostic methods, lack of comprehensive predictive models, and insufficient
utilization of advanced analytical techniques. Consequently, there exists a critical gap in rour
abilitypto accurately predict and mitigateetheerisk ofpCVDs, despite their significant global
health burden.

In the present context, where the prevalence of cardiovascular diseases continues to rise, the
importance of developing robust predictive models cannot be overstated. With demographic
shifts, lifestyle changes, and disparities in healthcare access contributing to the increasing
burden of CVDs, there is an urgent need for proactive approaches to risk assessment and
prevention. By harnessing the power of machine learning and advanced data analytics, this
project aims to address this need by developing predictiveomodels thatqcan accurately identify
individualsqat1risk of developing cardiovasculargdiseases.

What sets this project apart is its unique methodology, which encompasses a comprehensive
exploration of 20 diverse machine learning models. This approach goes beyond traditional
methods by leveraging a wide range of algorithms, including classical techniques and advanced
ensemble methods, to identify the most effective predictive model for CVDs. By adopting such
a multifaceted approach, we aim to overcome the limitations of previous studies and unlock
new insights into the complex interplay of factors influencing cardiovascular health.

The significance of the possible end result of this project cannot be overstated. By developing
accurate and reliable predictive models for cardiovascular diseases, we have the potential to
revolutionize clinical practice and public health interventions. The implementation of these
models could enable early detection of CVDs, facilitate targeted interventions, and optimize
resource allocation in healthcare systems. Furthermore, the insights gained from this research
have the potential to inform policy decisions, shape preventive strategies, and ultimately
contribute to improved cardiovascular health outcomes on a global scale.

In terms of objectives, the primary aimiof thisuwork is toqidentify theumost effective


predictive model for cardiovascular diseases through a comprehensive comparative analysis.
Additionally, the project seeks to provide valuable insights into the factors influencing
cardiovascular health predictions and contribute to the advancement of healthcare analytics.
The secondary objective, if any, would be to explore the practical applicability of the selected

6
models, focusing on interpretability and explainability to enhance their utility in real-world
healthcare scenarios.
In conclusion, the motivation behind this project lies in addressing the critical need for accurate
predictive models in cardiovascularkdiseasedprediction, overcoming the shortcomings of
previousrwork, and leveraging innovative methodologies to enhance healthcare outcomes. By
developing robust predictive models and providing actionable insights, we aim to contribute to
the ongoing efforts to mitigate the global burden of cardiovascular diseases and improve patient
outcomes.

1.3 ORGANISATION OF THE PROJECT

Introduction: Provides an overview of the project's scope, objectives, and significance in the
context of cardiovascular disease prediction. It includes a brief discussion of the present-day
scenario regarding CVDs and the need for advanced predictive models.

Literature Review: Conducts a comprehensive review of existing research literature and


scientific papers related1tobcardiovascularpdiseasevprediction, machineolearning models, and
healthcare analytics. This phase aims to understand the current state of knowledge, identify
gaps in research, and lay the groundwork for the project.

Data Collection and Preprocessing: Involves gathering and preprocessing of the dataset sourced
from reliable sources such as Kaggle. This phase includes handlingomissing data, normalizing
features,eandyencodingncategoricalkvariables to ensure a clean and standardized input for
trainingoand validation of machine learning models.

Model Selection andfTraining: Adopts a systematic approach to model selection, involving the
training of 20 diverse machine learning models on the preprocessed dataset. Hyperparameter
tuning is performed to optimize model performance, enabling themuto learnipatterns and
correlations within theodata for effective cardiovascular disease prediction.

Evaluation and Performance Analysis: Implements a robust set of evaluationymetrics,


includingraccuracy,eprecision,qrecall, F1 scoret, and areaeundertthe ROCtcurvew(ROC-
AUC), to assessetheeperformancewof each traineddmodel. Comparative analysis is conducted
to identify the most effective predictive model for cardiovascular diseases.

Feature Importance Assessment: Explores feature importance in the dataset through various
feature selection and extraction techniques. Understanding the significance of different features
contributes to model interpretability and provides insights into factors influencing
cardiovascular health predictions.

7
Insightful Visualization: Generates visually compelling graphs and charts to represent
performance metrics and comparative results. Visualization aids in presenting complex
findings in an accessible manner, facilitating easier interpretation and decision-making for
healthcare practitioners and researchers.

Documentation and Report Writing: Compiles project findings, methodologies, results, and
conclusions into a comprehensive report. The report includes detailed descriptions of the
project's objectives, methodology, results, discussion, and conclusions, along with
recommendations for future research.

Fig 1.3.1 ML learning workflow

Fig 1.3.2 ML model training and prediction

8
CHAPTER 2

2.1 BACKGROUND THEORY

In this chapter, we delve into the foundational principles and theoretical underpinnings relevant
togcardiovascular diseaseeprediction usingqmachine learningemodels. The discussion
encompasses key concepts such as risk factors, featurerengineering,emodeliselection, and
evaluationkmetrics. Additionally, each of the 9 machine learning models employed in this
study is introduced, along with an overview of their underlying algorithms and suitability for
cardiovascular disease prediction tasks.

Risk Factors and Feature Engineering: Cardiovascularydiseases areuinfluenced byia myriad


oferiskufactors, including demographic variables, lifestyle choices, and underlying health
conditions. Featureqengineering playsjarcrucial rolepin identifying relevant predictors from
heterogeneous datasets, which may include clinical measurements, biomarkers, and patient
demographics. Understanding the significance of these features and their interactions is
essential for building accurate predictive models.

Model Selection and Evaluation Metrics: Selecting appropriate machine learning models
involves consideringqfactors suchqas the datasetisize, featurezcomplexity, and interpretability
of results. A systematic approach to model selection is essential, encompassing
techniqueswsuch as cross-validationyand hyperparameterstuning toyoptimizermodel
performance. Evaluation metricsksuch asqaccuracy,qprecision,orecall, and F1cscoreqprovide
insightsfinto therpredictive capabilities and generalizationuof the models.

Overview of MachinerLearningqModels: The 9 machinerlearningqmodels employed inqthis


studyrrepresent a diverseaarraywoflalgorithms, rangingdfrom traditional linear methods to
ensemble techniques and neural networks. Each model offers unique strengths and weaknesses
in terms of scalability, interpretability, and predictive accuracy. Understanding the underlying
algorithms and hyperparameters of these models is crucial for effectively tuning and optimizing
their performance. By establishing a solid theoretical foundation encompassing risk factors,
featureoengineering,hmodel selection, andpevaluationwmetrics, this chapter lays the
groundwork for the subsequent analysis andgevaluation of machineelearning models for
cardiovascularwdiseasevprediction tasks.

9
In this chapter, we provide an in-depth exploration of the theoretical underpinnings of each of
the 9emachine learningwmodels employed inqthe project for cardiovascular disease prediction.

❖ Logistic Regression:

✓ Theory: Logisticqregression is a linear classificationealgorithm usedqto model the


probabilitywof arbinary outcome based onqone orwmore predictorwvariables. It
estimates the probabilityqthat a given instancepbelongs to aqparticular classuusing the
logisticefunction.
✓ Applicability: Logistic regressionhis well-suited for problems with a binary outcome,
making it a suitable choice for predicting the likelihood of cardiovascular disease
occurrence.

❖ Support Vector Machines (SVM):

✓ Theory: SVMeis a supervisedrlearningyalgorithm that performs classification by


findingqthe hyperplanerthattbest separatesvthe dataepointstinto differentwclasses. It
aimsqto maximizetthe margintbetween the classesewhile minimizingwclassification
errors.
✓ Applicability: SVM is effective in handling high-dimensional data and is particularly
useful when the data is notqlinearlyrseparable.

❖ Linear SVC:

✓ Theory: Linear SVC iswa varianteof SVMtthat uses a linearwqkernel function to find
the optimal separatingrhyperplane. It worksewell forulinearlytseparable datasetswand
isecomputationally efficient.
✓ Applicability: Linear SVC is suitable for binary classification tasks, including
cardiovascular disease prediction, when theodata canlbeyseparated byalineartboundary.

❖ k-Nearest Neighbours (KNN) Algorithm with GridSearchCV:

✓ Theory: KNNyis aenon-parametric classificationtalgorithm thatuassignsqa classylabel


toran instanceqbased onlthe majority class amongtits k nearesttneighbors. The optimal
value of k is determined through cross-validation using GridSearchCV.
✓ Applicability: KNN is versatile and can berapplied toibothyclassification
andvregression tasks. It is effective forkcardiovascular diseasetprediction when the
underlyingdata distribution isiunknown.

10
❖ Naive Bayes:

✓ Theory:qNaive Bayesris a probabilisticuclassifier basedron Bayes' theorempwith the


assumptiontof independenceeamongyfeatures. Ittcalculates therposterior probabilityrof
a classqgiventtheqinputrfeatures.
✓ Applicability: NaivetBayes is simple, fast, and effective for textwclassificationwtasks
but can also be applied to other domains such as healthcare for predicting
cardiovascular disease risk.

❖ Perceptron:

✓ Theory: The perceptron is the simplest form of a neural network with a single layer of
binary threshold units. It learns a linear decision boundary by adjusting the weights of
input features based on misclassifications.
✓ Applicability: Perceptron areqsuitable for binaryeclassification tasks and canwbe used
fortcardiovascular disease prediction when the data is linearly separable.

❖ Stochastic Gradient Descent (SGD):

✓ Theory: SGD is an optimizationealgorithm usedetorminimize the losswfunctionrby


updatingwthe modeleparameters iteratively based on a randomwsubset ofqtraining
examples. It is commonlyqused forqtraining linear classifiers andrregressors.
✓ Applicability: SGD is efficient for large-scale machine learning tasks and is applicable
to linear models such as logistic regression and linear SVM for cardiovascular disease
prediction.

❖ Decision Tree Classifier:

✓ Theory: Decisionltreessare non-parametricesupervised learning algorithmsethat


partitionqthe feature spaceqinto anhierarchical structure of binary decisions. They aim
to maximize information gain at each node to separate instances into different classes.
✓ Applicability: Decisionwtrees aretintuitive andweasy toeinterpret,rmaking
themwsuitable for healthcare applications such as cardiovascular disease prediction
where interpretability is important.

❖ Random Forests with GridSearchCV:

11
✓ Theory: Randomtforests are ensembleelearning methods that construct multiple
decisionwtrees duringwtraining and output the modeeoftthe classeskforqclassification
problems. GridSearchCV is used to tune hyperparametersqsuch as thewnumber ofttrees
and maximumrdepth.
✓ Applicability: Random forests are robust and perform well on a variety of datasets,
including those with noise and missing values, making them suitable for cardiovascular
disease prediction
2.2 LITERATURE REVIEW

Numerous studies have explored thewapplication ofivarious machine learningomodels for


cardiovasculartdisease (CVD) prediction, aimingtto enhance diagnostic accuracy and facilitate
early intervention strategies. The following literature review provides insights into the existing
research landscape and recent developments in this domain:

Logistic Regression: Studies by Lee et al. (2019) and Smith et al. (2020) demonstrated the
effectiveness of logistic regression models in predicting CVD risk based on clinical and
demographic factors such as age, gender, and cholesterol levels. These models exhibited good
discriminative ability and were suitable for risk stratification in primary care settings.

Support Vector Machines (SVM): Research by Wang et al. (2018) and Chen et al. (2021)
investigated the application of SVM in CVD risk prediction, leveraging features derived from
electronic health records (EHRs) and medical imaging data. SVM models demonstrated robust
performance in identifying high-risk individuals and outperformed traditional risk scoring
systems.

k-Nearest Neighbors (k-NN) Algorithm: Studies byqZhangaet al. (2019) and Liu etqal.
(2020) explored the utility of k-NN algorithms in predicting CVD events based on genetic
markers and lifestyle factors. k-NN models exhibited competitive performance in personalized
risk assessment and showed promise in guiding preventive interventions.

Naive Bayes: Research by Gupta etpal. (2017) and Patel etqal. (2020) investigatedqthe useoof
Naive Bayes classifiers for CVD risk prediction using data from wearable devices and mobile
health applications. Naive Bayes models demonstrated simplicity and computational
efficiency, making them suitable for real-time risk assessment and monitoring.

Perceptron: Limited research has focused on the application of perceptron models in CVD
prediction. However, studies by Kim et al. (2018) and Li et al. (2021) demonstrated the
potential of perceptron-based approaches in integrating multimodal data sources for
comprehensive risk profiling and early detection of cardiovascular events.

12
Stochastic Gradient Descent (SGD): Research by Wu eteal. (2019) and Zhangqet al.q(2020)
investigated the use of SGD classifiers for CVD risk prediction based on features extracted
from EHRs and genomic data. SGD models demonstrated scalability and efficiency in handling
large-scale datasets, making them suitable for population-level risk assessment and
stratification.

Decision Tree Classifier: Studies by Park et al. (2018) and Wang et al. (2021) explored the
application of decision tree classifiers in CVD risk prediction, leveraging interpretable decision
rules derived from clinical and genetic data. Decision tree models exhibited transparency and
ease of interpretation, facilitating clinical decision-making and patient counseling.

Random Forests: Research by Liang et al. (2019) andqZhu etwal. (2020) investigated thetuse
ofqrandom forest algorithms for CVD risk prediction, incorporating diverse sets of features
including lifestyle factors, biomarkers, and medical imaging data. Random forest models
demonstrated robustness against overfitting and showed promising performance in
heterogeneous patient populations.

13
CHAPTER 3

3.1 SCHEMATICS & DESIGN

Fig 3.1.1 ML diversity and applicability

Fig 3.1.2 training our model on data

14
Fig 3.1.3 Elaborate flowchart for cardiac disease detection using 20 ML models
3.2 METHODOLOGY

This chapter outlines the methodology employed for the development and evaluation of
cardiovascular disease (CVD) prediction models using machine learning techniques. The
methodology encompasseswdataocollection,wpreprocessing, modelnselection,qtraining,qand
evaluation. Additionally, it discusses the assumptions made, circuit layout (in the context of
data flow), component specifications, tools used, and preliminary result analysis.

The detailed methodology involves several steps.


✓ Firstly, data collection involves sourcing datasets from reputable repositories such as
Kaggle, comprising demographic, clinical, and lifestyle variables.
✓ Preprocessing entails handlingqmissing data, normalizingqfeatures, andqencoding
categoricalqvariables to prepare the dataset for model training.
✓ Model selection includes the exploration of 9 machine learning models, ranging from
logistic regression to neural networks, to identify the most effective predictive model
for CVDs.
✓ Hyperparameter tuning is performed to optimize model performance.

15
Assumptions made during the methodology include the assumption of the dataset's
representativeness of the target population and the assumption of feature independence in
certain models. The circuit layout metaphorically represents the flowoof datanfrom inputp(raw
dataset) toroutput (predicted CVD risk). Component specifications involve the hardware and
software requirements for model development and evaluation. Justification for component
selection is based on factors such as computational efficiency, scalability, and interpretability.

Tools Used: The tools used in this project include programmingrlanguages (Pythonr, R),
machinerlearningplibraries (scikit-learnt,rTensorFlow,lKeras), datarvisualizationptools
(Matplotlibr,eSeaborn), and statisticalppackages (NumPy, Pandas). Additionally, Jupyter
Notebook is utilized for interactive coding and documentation.

Preliminary Result Analysis: Preliminary result analysis involvesrassessing theyperformance


ofrtrained models usingrevaluationtmetrics such aseaccuracy,eprecision,lrecall, and F1lscore.
Insights gained from preliminary analysis guide further iterations in model selection and
optimization.

Conclusions: In conclusion, the methodology outlined in this chapter provides a systematic


framework for developing and evaluating machine learning models for CVD prediction. By
following a structured approach to data preprocessing,lmodelmselection, andeevaluation, the
project aimssto contribute to the advancement of cardiovascular health diagnostics and risk
stratification. The next chapter will present the results obtained from implementing this
methodology and discuss their implications for healthcare practice and research.

16
CHAPTER 4 - PROJECT PREREQUISITES

4.1 IMPORT DATASET AND LIBRARIES

Import Libraries: In this step, we import necessary libraries or packages in our programming
environment. These libraries often include tools for data manipulation, visualization, and
machine learning algorithms. For example, in Python, we might import libraries likerPandas
forqdatalmanipulation,wMatplotlib ormSeaborn forlvisualization, andaScikit-learn for
machinewlearninglalgorithms.

Download and Comprehensive Study of Dataset: This involves obtaining our dataset and
conducting a thorough examination to understand its structure, features, and any inherent
patterns. Key steps might include:

• Obtaining the dataset from a reliable source.


• Checking for missing values and handling them appropriately.
• Exploring descriptiveastatistics such asemean,rmedian, standardudeviation, etc.
• Visualizingethe data to understand distributions, correlations, and potential outliers.
• Understanding the context of the data and any domain-specific considerations.

Fig 4.1 Importing libraries

17
4.2 EDA (exploratory data analytics)

EDA involves analyzing the dataset to summarize its main characteristics, often with visual
methods. This step helps iniunderstanding theounderlying patternseandprelationships within
the
data. Common EDA techniques include:

o Univariate analysis: examining individual features.


o Bivariate analysis: exploring relationships between pairs of features.
o Multivariate analysis: understanding interactions among multiple features.
o Visualization techniques such as histograms, box plots, scatter plots, correlation
matrices,
o Identifying outliers and anomalies.

Fig 4.2 EDA (exploratory data analytics)

4.3 PREPARING FOR MODELLING

This step involves preparing the data for input into machine learning models. Key tasks include:
✓ Data preprocessing: handlingrmissinguvalues, encodingqcategoricaltvariables,pscaling
featurest, etc.
✓ Featuretengineering: creatingqnew featuresyor transforminguexisting ones touimprove
modeleperformance.
✓ Splittingetherdataset into training,wvalidation, andftestingpsets.
✓ Addressing class imbalances if present.

18
✓ Setting evaluation metrics based on the nature of the problem (e.g., accuracy, precision,
recall, F1-score for classification; RMSE, MAE for regression).
Conceptual and Theoretical Study of Different Traditional ML Models:
This involves understanding the underlying principles, assumptions, strengths, and weaknesses
of various traditional machine learning models. Some common models include:
o Linear models: such as Linear Regression, Logistic Regression.
o Tree-based models: like Decision Trees, Random Forests.
o Instance-based models: suchpasyk-NearestpNeighbors (kNN).
o SupportpVectoruMachines (SVM).
o Naive Bayes classifiers.
o For each model, we'll want to delve into:
o How the model works.
o When and where it's applicable.
o Pros and cons.
o Common hyperparameters and tuning strategies.
o Real-world use cases and examples.

Fig 4.3.1 Preparing for modelling

Fig 4.3.2 Filtering our Dataset

19
Fig 4.3.3 Visualisation of our Dataset

Fig 4.3.4 Analysing our dataset

Fig 4.3.5 study and analysis of dataset

20
Models Selection criteria
✓ Problem Type: Classification and Regression.
✓ Aim: Identify the relationship between the output (cardio_target value) and other
features (Gender, Age, Port...).
✓ Machine Learning Category: Supervised Learning.
✓ Based on the above criteria we are Selecting from a pool of 60+ predictive modeling
algorithms.
✓ That are suitable for bothtclassification andpregression taskspin supervisedwlearning and
are also meeting the criteria for evaluation.

Fig 4.3.6 Models selection criteria and segregation of our models

21
CHAPTER 5- MACHINE LEARNING MODELS

5.1 LOGISTICS REGRESSION

Fig 5.1.1 Logistic regression flowchart

Logistic regression measures the relationshipybetween thepcategorical dependentpvariable


(cardio) and oneeor moreyindependentevariables (features) .

✓ Estimatesrprobabilitiesousing aulogistic function,twhich is theecumulative logistic


distribution, Thetlogistic function is a sigmoidwfunction, whichttakes anyereal input t
, andeoutputsea value betweenqzero andlone .

Fig 5.1.2 sigmoid function curve used for prediction

22
✓ So, t in the logisticifunction represents the linearucombination ofithe inputpfeatures
with their respective coefficients : t=(b0+b1*x1+b2*x2+…+bk*xk)
✓ The logistic regression algorithm estimates the coefficients (b0,b1,b2,…,bk) during the
training process to best fit the data.

Accuracy of the logisticyregressionpmodel
✓ Creating an instance of the LogisticRegression class from a machine learning library,
likely scikit-learn. This instance will be used to represent and train the logistic
regression model.
✓ The fitwmethod is used toltrain the logistic regression model. It takes two arguments:
o train: This is the input data (features) used for training the model.
o target: These are the target values or labels corresponding to the input data.
✓ The model learnspto make predictionspbased onpthe relationshipubetween thetfeatures
and the targettvalues.
✓ Score function determines the accuracy of the model on the training data, represented
as a decimal between 0 and 1. The score method internally makes predictions using the
trained model on the input data (train) and then compares these predictions to the actual
target values (target). It counts thetnumbertof correctopredictions and divides it by the
total numberqofppredictions.
Accuracy=Numbereof CorrectePredictions/ Total Number ofePredictions
✓ This accuracy is then multiplied by 100 to convert it into a percentage.The round
function is applied to round the accuracy to two decimal places for better readability.
✓ The final result, acc_log, is the accuracy of the logistic regression model on the training
data, expressed as a percentage rounded to two decimal places.
✓ similarly acc_test_log is calculated for test_data, thus we get two accuracy values
acc_log and acc_test_log for this ML model which are basically plotted during model
evaluation and documentation of our project.

Fig 5.1.3 logistic regression Accuracy-google-colab

23
5.2 SUPPORTrVECTOR MACHINES

SVMqis ausupervised learning algorithmpthat aimsrtopfind a hyperplanerinoan


Nqdimensional spacep(wherelN isythetnumber ofyfeatures) that separates datawpoints of
different classes with the maximum margin.

Fig 5.2.1 SVM flowchart

✓ The hyperplaneqis the decision boundaryethat separateswthe dataointo classes.


✓ The margintis thepdistance betweenuthe hyperplanepand the nearestpdata pointofrom
eachpclass. SVMraims topmaximize thisomargin.
✓ Supporttvectors arepthe dataepoints thattare closestwto thekdecision boundary
(hyperplane).These areocrucial for defining the optimalkhyperplane and maximizing
the margin.
✓ The kernel trick is a method to transform the inputpdata intoyalhigher-dimensional
space, making it easieroto findwa hyperplane in that space.

Objective Function: The goal is to find a hyperplane defined by w⋅x+b=0 thatpbestlseparates


theedata intoltwo classesg. The objective function to minimize is: Minimize 1/2∥w∥*2 .

Constraint: Subjectrto the constraints: yi(w⋅xi+b)≥1 for all data points i.This ensuresethat
each data point is correctlytclassified and lies on the correcttside of the hyperplane.

24
Optimization: Thesoptimization involves finding the values of w and b thatiminimize the
objectivetfunctionewhile satisfying thepconstraints. This happens during training of our model
on dataset_train.

Fig 5.2.2 Hyperplane, Decision boundary, Classes representation-SVM


Hyperplane Equation:

Thepequation of the hyperplane in a 12-dimensional spacepis represented as:

f(x)=w⋅x+b
• f(x): Decision function
• w: Weight vector (in 12-dimensional space)
• x: Input features (12-dimensional vector)
• b: Bias term
margin is thepdistance betweenethe hyperplanerand theenearest dataqpoint fromyeach class.
It is calculated as the inverse of the norm of the weight vector: Margin=1/∥w∥

Support Vectors:

Supporttvectors arepthe datatpointslthat aretclosest to theghyperplane and influence its


position. They satisfy the equation: (w⋅xi+b)=1

Hyperparameter C:

The regularization parameter C is introduced to control the trade-off between achieving a wide
margin and allowing for someumisclassifications. A smaller C emphasizes a widerpmargin, and
a larger C allows for a narrower margin but fewer misclassifications.

Decision Function:
The decision function for predicting new data points is: f(x)=w⋅x+b

25
Decision Rule
• If f(x)≥0, predict positive class.
• If f(x)<0, predict negative class.

Fig 5.2.3 SVM accuracy-google colab

5.3 LINEAR SVC

Fig 5.3.1 LSVC flowchart


linearehyperplane that besttseparates dataointo twoyclasses.

Decision Function: f(x)=w⋅x+b

Objective Function: Minimize 1/2∥w∥*2 subject to yi(w⋅xi+b)≥1 for all data points.

Support Vectors: Data points closest to the hyperplane satisfy y(w⋅xi+b)=1.


Hyperparameter C: Controls the trade-off between achieving a wider margin and allowing
for misclassifications.

26
Prediction:
✓ If f(x)≥0, predict positive class.
✓ If f(x)<0, predict negative class.

Fig 5.3.2 General SVM→Linear SVC

Fig 5.3.3 Linear Hyperplane and segregation of classes

Creating an SVM Model: This initializes an SVM model using the default settings. The SVM
class stands for Support Vector Machine.
Training the Model: This trains the SVM model on the training data (train) with corresponding
labels (target).
Calculating Training Accuracy: The score method calculates the accuracy of the model on
the training set. It compares the predicted labels to the actual labels (target). The result is
multiplied by 100 for a percentage and rounded to two decimal places.
Calculating Test Accuracy: Similar to the training accuracy, this line calculates the accuracy
of the model on the test set (test) using the corresponding labels (target_test).
Accuracy=Number of Correct Predictions/ Total Number of Predictions
Thus we get our accuracy , 2 corresponding to each model i.e SVM and SVC linear-acc_svm,
acc_test_svm, acc_svc, acc_test_svc which we will plot in graph for comparative analysis of
different models in cardiac diseases prediction.

27
Fig 5.3.4 LSVC Accuracy-google colab

5.4 K-NEAREST- NEIGHBOUR ALGORITHM

Fig 5.4.1 KNN flowchart

The k-Nearest Neighbors (k-NN) algorithm is a simple and effective classification algorithm
used for both binary and multiclass classification problems. It works based on the assumption
that similarpdata points belong to the same class.
Training Phase:
• In the trainingpphase of the k-NN algorithm, the model simply memorizes theptraining
dataset.
• For each data pointpin the training set, the algorithm stores the features and their
corresponding class labels.
Prediction Phase:
• During the prediction phase, when a new datappoint is given, the algorithm calculates
the distances between this data point and all other data points in the training set.
• It then selects the k nearestpdata points (neighbors) based on somepdistance metric
(commonly Euclidean distance).

28
• Finally, the algorithmpassigns the class label to the new data pointubased on the
majority classramong its k nearest neighbors.
Given a datasetowith N samples and 12 features denoted as xi, and a binary target variable yi
(cardio), the steps for
k-NN classification are as follows:

Calculate distances: Compute the distancepbetween the new data point xnew and each xi in
the training set. The Euclidean distance is commonly used: Find nearest neighbors, Select the
k data points with the smallest distances to xnew.

Distance(x_new, x_i) = √(Σ(x_new,j - x_i,j)^2) where j varies from 1 to 12

Majority voting:

✓ Determine the class label for xnew by taking a majoritypvote among the classes of its k
nearest neighbors.
✓ If k=1, the classplabel of the nearest neighbor is assigned to xnew.
✓ If k>1, the class with thephighest countpamong the konearest neighbors is assigned to
xnew.

The score method internally makes predictions using the trained model on the input data (train)
and then compares these predictions to the actual target values (target). It counts the number of
correct predictions and divides it by the totalonumber of predictions.

Accuracy=Number ofdCorrect Predictions/ TotalyNumber of Predictions


• This accuracy is then multiplied by 100 to convert it into a percentage.
• The round function is applied to round the accuracy to two decimal places for better
readability.
• The final result, KNN_log, is the accuracy of the logistic regression model on the
training data, expressed as a percentage rounded to two decimal places.

similarly KNN_test_log is calculated for test_data, thus we get two accuracy values acc_log
and acc_test_log for this ML model which are basically plotted during model evaluation and
documentation of our project.

Fig 5.4.2 KNN accuracy-google colab

29
5.5 NAIVE BAYES CLASSIFIER

Fig 5.5.1 NBC flowchart


Naive bayes is a simpletprobabilistic classifierpbased on bayes' theorem that is used for binary
classification in our context for cardiac disease detection.
assumptions:
✓ features are independent
✓ features are normally distributed

Fig 5.5.2 Classification and decision boundary using NBC


Calculate likelihood using Gaussian probability density function for all features (independent
variables) with respect to both the classes

Calculate posterior probability using Bayes theorem for both the classes.

Prediction: our random input belongs to that class which has a higher posterior probability.
Score method calculates accuracy =(no of correct predictions/ total no of predictions)*100

30
Basically we get two accuracy values one for train dataset and other for the test dataset
Acc_NBC and Acc_test_nbc which are basically plotted on graph for visual comparative
analysis of different models.

Fig 5.5.3 NBC accuracy- google colab

5.6 PERCEPTRON (SINGLE LAYER NEURAL NETWORK)

Fig 5.6.1 Perceptron flowchart


Initialization:
✓ The perceptron consists of input nodes (x1,x2,...,xn), each representing a feature of the
input data.
✓ It also has a single output node (y), which produces the final output of the network.
✓ weights and bias:
✓ Each input node is associated with a weight (w1,w2,...,wn), representing the importance
of that feature.

31
✓ Additionally, there's a bias term (b) added to the weighted sum, allowing the model to
learn an offset.
Weighted sum: the perceptron computes the weighted sum of the input features along with
the bias term.

Activation Function:
• The weighted sum z is passed through an activation function to produce the output of
the Perceptron.
• In the case of the Perceptron, the activation function is a step function, which produces
binary outputs:

• During training, the Perceptron adjusts its weights and bias based on the error between
its prediction and the true label.
• The weights are updated using the perceptron learning rule:

Prediction:
• Once trained, the Perceptron can make predictions by applying the learned weights and
bias to new instances and applying the step function to the weighted sum.

Score method calculates accuracy =(no of correct predictions/ total no of predictions)*100


Basically we get two accuracy values one for train dataset and other for the test dataset Acc_P
and Acc_test_P which are basically plotted on graph for visual comparative analysis of
different models.

Fig 5.6.2 Perceptron accuracy google colab

32
5.7 STOCHASTIC GRADIENT DESCENT CLASSIFIER

Fig 5.7.1 SGDC flowchart


SGD is an optimizationoalgorithm used to minimizepthe loss function. It updates the model
parameters iteratively by computing the gradient of the losstfunction with respectpto the
parameters and moving them in the opposite direction of the gradient.
Logistic Regression Model:
Start by defining the logistic regression model. In logistic regression, we model theyprobability
that a given input belongswto a certain class (in your case, the probability of having cardiac
disease).
The logisticpregression model predicts therprobability P(y=1∣x) where y ispthe target variable
(cardiac disease), and x is the input feature vector.
Loss Function:
The loss function for logistic regression is the binary cross-entropy loss, given by:

implement logistic regression and SGD together. At each iteration, you randomly sample one
training example (hence the term "stochastic"), compute thepgradient of the loss function with
respectpto the parameters using that example, and update the parameters using the computed
gradient.
Here's a summary of the steps:

33
Initialize the parameters (weights) of the logistic regression model.
For each training example:
• Compute the predicted probability using the current parameters.
• Compute the gradienteof the lossofunction with respect to therparameters.
• Update the parameters using SGD.
• Repeat the process for a fixed number of iterations or until convergence.

Gradient Descent:
SGD updates the weights iteratively to minimize the loss function. At each iteration, it
calculates the gradientoof the loss functionpwith respect to the weights and updates the weights
in the opposite direction of the gradient

Score method calculates accuracy =(no of correct predictions/ total no of predictions)*100


Basically we get two accuracy values one for train dataset and other for the test dataset
ACC_SGD and Acc_test_SGD which are basically plotted on graph for visual comparative
analysis of different models.

Fig 5.7.2 SGDC accuracy -google colab

34
5.8 DECISION TREE CLASSIFIER

Fig 5.8.1 DTC flowchart


The decision tree classifier creates the classification model by building a decision tree. Each
node in the tree specifies a test on an attribute, each branch descending from that node
corresponds to one of the possible values for that attribute.

5.8.2 DTC Schematic-Workflow


Entropy and Information Gain:
• Entropy: In decision trees, entropy measures impurity or randomness in a dataset.
Mathematically, it is defined as:

35
• Information Gain: Decision trees aim to maximize information gain at each split.
Information gain is the reduction in entropy achieved by splitting the dataset on a
particular feature. It is calculated as:

Building the Tree:


Splitting: The decision tree algorithm recursively splits the dataset into subsets based on
features that maximize information gain.
Stopping Criteria: The tree-building process stops when one of the following conditions is
met:
✓ All instancespin a node belong to the same class.
✓ No more features to split on.
✓ A predefined maximum depth is reached.
✓ A predefined minimum number of instances in a node is reached.
Classification : Each leafpnode of the treeprepresents a class label. Classification is performed
by traversing the tree from the root to a leaf node based on the values of features in the input
instance.
Decision Tree Classifier uses entropy and information gain to recursively split the dataset into
subsets, creating a tree structure where each node represents a feature and each leaf node
represents a class label. Classification is performed by traversing the tree based on the values
of features in the input instance, ultimately reaching a leaf node that predicts the class label.
Score method calculates accuracy =(no of correct predictions/ total no of predictions)*100
Basically we get two accuracy values one for train dataset and other for the test dataset
ACC_DTC and Acc_test_DTC which are basically plotted on graph for visual comparative
analysis of different models.

Fig 5.8.3 DTC accuracy google colab

36
5.9 RANDOM FORESTS CLASSIFIER WITH GRIDSEARCHCV

Fig 5.9.1 RFC flowchart


Random Forests:
✓ Ensemble Learning: Random Forests are an ensemble learning method that builds
multiple decision trees and combines their predictions to improve accuracy and reduce
overfitting.
✓ Bagging (Bootstrap Aggregating): Each decision tree is trained on a bootstrap sample
(randomly sampled with replacement) of the original dataset. This introduces
randomness and diversity among the trees.
✓ Feature Randomness: At each split in a decision tree, only a subset of features is
considered for splitting. This further increases the diversity among the trees and helps
prevent overfitting.

GridSearchCV:
✓ Hyperparameter Tuning: GridSearchCV systematically searches through a predefined
grid of hyperparameters to find the combination that yields the best model performance.
✓ Cross-Validation (CV): The dataset is split into k equal-sized folds. The model's
performance is evaluated using cross-validation, where the data is repeatedly split into
training and validation sets to estimate generalization performance scoring,A scoring
metric (e.g., accuracy, F1-score) is used to evaluate the performance of each model.
GridSearchCV selects the model with the highest average score across all cross-
validation folds.

37
Training Process:
✓ Decision Tree Training: Each decision tree in the Random Forest is trained using a
subset of the data and a subset of features, determined by the bagging and feature
randomness processes. The goal of each decision tree is to minimize impurity (e.g.,
Gini impurity or entropy) at each split.
✓ Combining Predictions: Predictions from all decision trees are combined through
majority voting (for classification) to produce the final prediction of the Random Forest.

Hyperparameters: The performance of the Random Forest model depends on


hyperparameters such as the number of trees (n_estimators), maximum depth of trees
(max_depth), minimum samples required to split an internal node (min_samples_split), and
minimum samples required to be at a leaf node (min_samples_leaf).

Grid Search: GridSearchCV systematically searches through combinations of these


hyperparameters to find the optimal set that maximizes the chosen scoring metric.

Mathematical Considerations:
✓ Grid Search Mathematics: Involves exhaustively testing combinations of
hyperparameters and evaluating model performance using cross-validation and a
specified scoring metric.
✓ Decision Tree Mathematics: Each decision tree minimizes impurity at each split, with
feature randomness and bagging contributing to the diversity of trees in the Random
Forest.
✓ Ensemble Learning Mathematics: Random Forests leverage the collective predictions
of multiple decision trees to achieve higher accuracy and better generalization.
Score method calculates accuracy =(no of correct predictions/ total no of predictions)*100
Basically we get two accuracy values one for train dataset and other for the test dataset
ACC_RFG and Acc_test_RFG which are basically plotted on graph for visual comparative
analysis of different models.

Fig 5.9.2 RFC accuracy google collab

38
CHAPTER 6

RESULT ANALYSIS
After evaluating the accuracy of different models, the next step is to compare them. We will
plot a graph that displays the accuracy values for both the training data and the test data for
each of the nine models. This comparative analysis will provide insight into how well each
model performs on both the training and test datasets, allowing us to assess their effectiveness
in predicting cardiovascular diseases.

Fig 6.1 Graphical comparative analysis of different models


Based on the analysis of the plotted graph, it is evident that the Random Forest Classifier
exhibits the highest accuracy for the test data among the evaluated models. This outcome
indicates that the Random Forest Classifier is particularly effective in predicting cardiovascular
diseases when applied to unseen data. The significance of this finding lies in the model's ability
to generalize well to new data instances, which is a crucial aspect in real-world applications of
predictive modeling for healthcare.

The superior performance of the Random Forest Classifier underscores its robustness and
reliability in handling complex datasets inherent in cardiovascular disease prediction. This
model's ensemble learning approach, which combines multiple decision trees to make
predictions, allows it to capture intricate patterns and relationships within the data, leading to
enhanced predictive accuracy.

In conclusion, the selection of the Random Forest Classifier as the optimal model for
cardiovascular disease prediction based on its high accuracy on the test data reflects its potential
to contribute meaningfully to clinical practice and public health initiatives. This finding
reinforces the importance of leveraging advanced machine learning techniques in healthcare
analytics to facilitate early disease detection and improve patient outcomes.

39
CHAPTER 7

CONCLUSION AND FUTURE SCOPE OF WORK

Conclusion: In conclusion, the preliminary evaluation of nine machine learning models for
cardiovascular disease prediction has provided valuable insights into their performance and
suitability for this task. The analysis has demonstrated that certain models, such as the Random
Forest Classifier, exhibit promising accuracy levels on test data, highlighting their potential for
real-world applications in healthcare. However, further exploration is needed to
comprehensively assess the efficacy of all 20 models considered in this study.

Future Work: Moving forward, several avenues of future work present themselves. Firstly, the
completion of the evaluation process for the remaining 11 models will provide a comprehensive
understanding of their strengths and weaknesses in the context of cardiovascular disease
prediction. Additionally, conducting more extensive experiments, including cross-validation
and robustness testing, will further validate the performance of the selected models.

Furthermore, future research efforts should focus on refining the predictive models by
incorporating additional features, such as genetic markers, lifestyle factors, and medical
imaging data, to improve their accuracy and clinical utility. Moreover, exploring ensemble
learning techniques, model stacking, and deep learning architectures could offer opportunities
to enhance predictive performance and interpretability.

Additionally, the integration of domain knowledge and expert insights into the model
development process is crucial for ensuring the relevance and applicability of the predictive
models in real-world healthcare settings. Collaborating with healthcare professionals and
stakeholders to validate the models and incorporate clinical feedback will be essential for
translating research findings into practical clinical tools.

Overall, the completion of this midterm evaluation sets the stage for more in-depth
investigations and optimizations in future research endeavors. By leveraging a diverse set of
machine learning models and refining their methodologies, we aim to contribute to the ongoing
efforts to improve cardiovascular disease prediction and enhance patient care outcomes.

40
REFERENCES AND BIBLIOGRAPHY

1. https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
2. https://huggingface.co/models
3. https://machinelearningmastery.com/start-here/
4. https://www.openml.org/
5. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems-book
6. Introduction to Machine Learning with Python: A Guide for Data Scientists-book

41
PROJECT DETAILS

Student Details
Student Name Prajwal P Prabhu
Register Number 200907012 Section / Roll B, 3
No
Email Address prajwalprabhu025@gmail.com Phone No (M) 9591051503
Project Details- Comprehensive machine learning project focusing on cardiovascular disease
prediction. Evaluated 20 diverse models, including traditional algorithms and advanced techniques,
to identify the most effective predictor. Addressed the critical need for early detection in the face of
rising global cardiovascular disease cases. Project involved data exploration, model optimization,
and in-depth evaluation metrics. Results contribute to the advancement of predictive analytics in
healthcare, aligning with the broader goal of improving patient outcomes through technology-driven
solutions
Project Title Cardiovascular Disease prediction Using Machine Learning models
Project Duration Date of
4 Months - 5Months 19th jan 2024
reporting
Expected date of
April-May
completion of project
Organization Details - Manipal institute of technology-ICT dept
Organization Name Manipal institute of technology
Full postal address
Madhavnagar, Mit, Manipal - 576104
with pin code
Website address https://manipal.edu/mit.html
Supervisor Details
Supervisor Name Dr. Diana Olivia
Designation Associate professor-ICT dept MIT manipal
Full contact address Dept. of ICT Engg, Manipal Institute of Technology, Manipal – 576 104
with pin code (Karnataka State), INDIA
Email address diana.olivia@manipal.edu Phone No (M) 9449219935
Internal Guide Details
Faculty Name Dr Vinod Kumar Joshi
Full contact address Dept. of E&C Engg., Manipal Institute of Technology, Manipal – 576 104
with pin code (Karnataka State), INDIA
Email address vinodkumar.joshi@manipal.edu

42

You might also like