You are on page 1of 49

A Framework for Cardiovascular Risks

Prediction in T2DM using Logistic Regression


and Score System

1
Content
 Introduction
 Challenges
 Literature Survey
 Motivation
 Proposed methodology
 Experimental result
 Conclusion
 Future scope
 Reference

2
Introduction
• Type 2 diabetes mellitus[T2DM] is a deadliest disease affecting 90% of people
worldwide
• The development of T2DM is caused by a combination of lifestyle and genetic
factors
• Cardiovascular disease(CVD) is a serious long-term diabetes complication
• The risk of CVD increases by 2 to 4 times for T2DM patients
• The mortality rate of CVD in Indian T2DM patients is 70%

3
Challenges( ki ki prb ache ei field e )

• Detection of CVD using traditional procedure is cumbersome and costly

• Analysing the risk factors of CVD involved in T2DM is laborious process

• Assessing the probability of CVD in T2DM patients

4
Related Work
• The European Associations for the study of Diabetes recommend FRAMINGHAM [] as a
CVD risk prediction model
Limitations
Applicable to the general population

 The risk of CVD in T2DM patient is underrated

• Guidelines of International Diabetes Federation recommend the UKPDS risk engine []


as a CVD risk prediction model
Limitations
The discriminating performance of the model suffers from variation and poor calibration
The associations between T2DM and CVD is not addressed

5
Related Work
Year Author Data Mining Algorithms Accuracy

2007 Yanwei X et al. ANN, DT 91%, 89.6%


2012 Peter et al. Naïve Bayes, DT, ANN 83.70%, 76.66%, 78.14%

2012 Maishowman et SVM with Bagging algorithm 84.1%


al.
2013 Sellappan et al. Naïve Bayes, DT, ANN 85.68%, 86.12%,
80.4%
2013 Bahadur et al. DT, Naïve Bayes 99.1%, 96.5%, 88.3%

2017 Heryadi et al. DT 97.3%


2017 Zarkogianni et al. Hybrid Wavlet Neural Network 71.48%
and Self-Organizing maps

Table 1: Accuracy comparison of Data Mining algorithms used in CVD prediction


6
Motivation (ki ki flaws ache related work)
 Discovering the associations between T2DM and CVD
 Preventing the adverse impact due to delayed diagnosis of CVD risks
 Predicting atherogenic CVD risk Prevalence in T2DM patients

 Providing cost-effective treatments for involved CVD risks in T2DM patients


 Implementing a score system that helps to diagnosticate and ascertain the CVD risks in
T2DM by preventing costly medical complications

7
Proposed methodology(1/)

Feature Original
Feature Selected
Construction Feature
Selection Features
set

Data
420 samples 294 samples
of T2DM of T2DM 162 Training Classification
processing algorithm
patients patients Cases Data
Sample
labelling Testing
Data
132
Controls
Final Estimated
evaluation Accuracy
Figure 1: Block diagram of the proposed framework 8
Proposed methodology(2/)
Data Collection

 Information of 420 T2DM patients is gathered.

 15 significant traits of CVD are used to construct dataset.

 420 samples of T2DM patients are labelled as

 Type 2 diabetic patients with known CVD


 Type 2 diabetic patients without CVD

9
Proposed methodology(3/)
Data Preparation

 Samples having missing attributive values are discarded.

 294 samples of T2DM patients are obtained.

 The dataset is divided in two set

 The first set used for model training.


 The Second set used for hold-out model validation.

10
Proposed methodology(4/)
Feature Description
 A unique measurable property of a phenomenon.

 Informative features raise the performance level of a classifier.

 13 features are build using samples of dataset.

 The entire feature set is divided into two groups


 Categorical features
 Numerical features

11
Proposed methodology(4/)

Name of features Category Description


Age Numerical Age of each patient in years
Gender Categorical Sex of patient. Possible values are Male and Female
Obesity Categorical Obese nature of patient. Possible values are Normal or
Obese
Family history Categorical Family history of CVD. Possible values are Yes or No

Foot Infection Categorical Presence of foot infection. Possible values are Yes or No
Smoking Categorical Smoking habit of patient. Possible values are Yes or No

Alcohol Categorical Alcohol habit of patient. Possible values are Yes or No

Duration of T2DM Numerical Presence of T2DM in years

Table 2: Feature Description 12


Proposed methodology(5/)

Name of features Category Description


Hypertension Categorical Hypertension nature of the patient. Possible values are
Yes and No
PPBS value Numerical Postprandial glucose value of the patient
FBS value Numerical Fasting blood sugar value of the patient

Family history Categorical Family history of CVD. Possible values are Yes or No

Dyslipidemia Categorical Presence of dyslipidemia. Possible values are Yes or No

Cholesterol Categorical Presence of cholesterol. Possible values are Yes or No

Table 2: Feature Description

13
Proposed methodology(6/)
Classification model
 A component of ML algorithms to extract patterns from data

 Several classification algorithms are used in CVD risk prediction

 Logistic regression can be used in developing predictive models

 To improve the accuracy of classifier following operations are performed


 Outlier detection
 Multicollinearity Check
 Feature Selection

14
Proposed methodology(7/)
Outlier detection
 Detection of outliers is a crucial task for classification models.

 Outlier is a point that is numerically distant from other data points.

 The following methods are used to identify outliers


 Cook’s Distance
 DFFITS

15
Proposed methodology(8/)
•  
Cook’s distance
 Detects outliers in multivariate data.

 The formula of Cook's Distance method can be defined as

Di =
Where,
is the prediction value for observation j
is the prediction value for observation j without including point i
σ2 is the estimation of error variance.
p is the number of independent variables present in the model.
16
Proposed methodology(9/)
•  
DFFITS method
 Measures the impact of a observation on the response value

 The formula used by DFFITS method can be defined as


(DFFITS)i =
where,
the prediction value for observation j

is the prediction value for observation i


S(i) is the estimation of standard error
is  the leverage for the observation i.

17
Proposed methodology(10/)
Multicollinearity Check

 Multicollinearity is a significant problem in classification algorithms.

 Increases the variance of coefficient estimation process.

 Multicollinearity check is performed using Generalized Variance Influence Factor


(GVIF).

 The acceptable standard for Multicollinearity detection using GVIF is considered.

18
Proposed methodology(11/)
Feature Selection
 A process of identifying subset of features

 Removes unnecessary features, increases the quality of prediction

 Finds most parsimonious model in terms of performance and lower computational


cost

 We adopt three methods of Wrapper model for Feature Selection


 Forward Selection
 Backward Deletion
 Stepwise Regression

19
Proposed methodology(12/)

Label
information
Learning
algorithm
Feature
selection

Feature Features
Training set generation Classifier

Figure 2: General Feature Selection framework for Classification model


20
Proposed methodology(13/)
ROC Curve

 Demonstrates the diagnostic ability of a binary classifier 

 Represents a sensitivity/specificity pair corresponding to a particular decision


threshold

 Identifies the optimum value of decision threshold for the proposed classifier

21
Proposed methodology(14/)
Model Validation
 Process of determining resemblance of the model with the real data.

 It can be performed using two types of dataset


 Hold-out dataset
 Out-of-sample dataset
 Hold-out model validation is done by 120 samples of T2DM patient

 Out-of-sample model validation is done by 90 samples of another dataset

22
Proposed methodology(15/)
Current scenario
 Markable flare of CVD risks in T2DM patients is observed in India

 Majority of such cases remain undetected and uncared.

 Traditional CVD risks investigation costs is exorbitant

 Increases the burden on the societal and economic factors.

23
Proposed methodology(16/)
Name of Test Average Purpose
price
TMT Rs. 1800 To asses the functionality of heart and blood vessels

Holter Monitoring Rs. 2000 To assess heart abnormalities


Echocardiogram(ECHO) Rs. 1600 To monitor heart rate and heart valves

Electrocardiogram(ECG) Rs. 400 To detect irregularities in heart rhythm

Coronary angiography Rs. 6000 To check blood vessel problems and heart
abnormalities
Carotid ultrasound Rs. 1000 To assess the risk of stroke

Table 3: Economic burden(macro level): Cost of Different Tests in CVD Complications


24
Proposed methodology(17/17)
Score System
 Categorizes the possibility of occurrence of CVD complications in T2DM

 Score for numerical feature is determined using Logistic regression curve

 Score for categorical feature is determined by observing the associations


between features

 Total score is calculated by adding scores of each feature

 A cost-effective tool for screening CVD risks in T2DM patients(conclusion)

25
Experimental Analysis(1/
Outlier Detection analysis
Cook’s distance method finds the following observations as suspicious
28, 172, 67, 265, 102, 58, 70, 215

Figure 3: Outlier detection using Cook's Distance method 26


Experimental Analysis(2/
DFFITS method finds the following observations as suspicious
28, 172, 67, 265, 102, 58, 70, 215

Figure 4: Outlier detection using DFFITS method 27


Experimental Analysis(3/
Multicollinearity Check analysis
Features GVIF Df GVIF(1/(2*Df))
Age 1.6569 1 1.2872
Gender 3.7622 1 1.9396
Family history 1.1565 1 1.0754
Foot infection 1.1257 1 1.0610
Smoking 3.7245 1 1.9299
Alcohol 1.8232 1 1.3502
Duration of diabetes 1.1666 1 1.0801
Obesity 1.5729 1 1.0582
Hypertension 1.1694 1 1.0814
FBS Value 2.0692 1 1.4385
Dyslipidemia 1.3333 1 1.1546

Table 4: Multicollinearity Check analysis 28


Experimental Analysis(4/
 Backward Deletion method selects the following features from the original set of
features
 Age
 Gender
 Family history
 Foot infection
 Smoking
 Duration of diabetes
 Obesity
 Hypertension
 FBS Value
 Dyslipidemia
 Cholesterol
29
Experimental Analysis(5/

Table 5: Summary of Backward Deletion method

Table 6: Performance comparison of different Feature Selection methods


30
Experimental Analysis(6/

Figure 5: Gender(T2DM patient) vs. Model Response


31
Experimental Analysis(7/

Figure 6: Smoking habit(T2DM patient) vs. Model Response


32
Experimental Analysis(8/

Figure 7: Family history (T2DM sample) vs. Model Response 33


Experimental Analysis(9/

Figure 8: Dyslipidemia level(T2DM sample) vs. Model Response


34
Experimental Analysis(10/

Figure 9: Cholesterol level(T2DM sample) vs. Model Response 35


Experimental Analysis(11/
ROC Curve Analysis

Figure 10: ROC Curve Analysis 36


Experimental Analysis(12/
• Sensitivity and specificity table ta hbe

37
Experimental Analysis(13/
Score System analysis
• Score system is developed using the features identified by Backward Deletion or
Stepwise Regression method.

• Score assignment for each individual risk factor is done with utmost care

• The efficacy of the Score system is evaluated using in-sample dataset and out-
of-sample dataset.

38
Experimental Analysis(14/

Name of features Score


Age
Age < 41 years 1
Age < 51 years 3
Age < 61 years 5.5
Age < 71 years 7.5
Age < 100 years 8.6
Gender
Male 7
Female 8

Table 7: Score Assignment for Individual Factors


39
Experimental Analysis(15/

Name of features Score


Family history
If Family history is present and Gender of the patient is Male 5.5
If Family history is present and Gender of the patient is Female 6
Smoking 5.5
If the Gender of patient is Female and Smoking habit is level 1 8
If the Gender of patient is Male and Smoking habit is level 1 7
Duration of diabetes
Duration <11 years 2.75
Duration < 21 years 6.75

Table 7: Score Assignment for Individual Factors


40
Experimental Analysis(16/
Name of features Score
FBS value
FBS value <201 2.5
FBS value <351 6.5
FBS value <601 8.75
Dyslipidemia
If Dyslipidemia is present and Family history is level 1 6

If Dyslipidemia is present and Family history is level 0 5


Foot Infection
If Foot Infection is present and Gender of the patient is Female

If Foot Infection is present and Gender of the patient is Male

Table 7: Score Assignment for Individual Factors 41


Experimental Analysis(17/
Name of features Score
Cholesterol
If Cholesterol is present and Smoking habit is level 1 and the Gender of the 7
patient is Male and Age is <44 years
If Cholesterol is present and Smoking habit is level 1 and Gender of the patient is 8
Male and Age is >=45 years
If Cholesterol is present and Smoking habit is level 0 and Gender of the patient is 6
Male and Age is < 44 years
If Cholesterol is present and Smoking habit is level 0 and Gender of the patient is 7
Male and Age is >=45 years
If Cholesterol is present and Smoking habit is level 1 and Gender of the patient is 7
Female and Age is < 54 years
If Cholesterol is present and Smoking habit is level 1 and Gender of the patient is 8
Female and Age is >= 55 years

Table 7: Score Assignment for Individual Factors 42


Experimental Analysis(18/
Name of features Score
If Cholesterol is present and Smoking habit is level 0 and Gender of the patient is 6
Female and Age is <54 years
If Cholesterol is present and if Smoking habit is level 0 and Gender of the patient 7
is Female and Age is >= 55 years
Obesity
If Obesity is present Gender of the patient is Male
If Obesity is present Gender of the patient is Female
Hypertension
If Hypertension is present and Gender of the patient is Female
If Hypertension is present and Gender of the patient is Female

Table 7: Score Assignment for Individual Factors 43


Experimental Analysis(19/

Validation Accuracy Sensitivity Specificity


dataset
Classification Hold-out dataset
Model Out-of-sample
dataset
Score system In-sample-dataset
Out-of-sample
dataset

Table 8: Performance Summary for Classification Model and Score System

44
Experimental Analysis(20/

• A machine learning-based framework is developed to diagnosticate and ascertain the


CVD risks in T2DM by preventing costly medical complications. In addition, the
predictive power of Logistic regression is explored as a classifier to predict the presence
or absence of a characteristic or outcome based on the values of some set of predictor
variables. The proposed Score system helps in cost-effective screening for CVD risk in
T2DM patients as it uses simple, safe and inexpensive measures. Considering the
average TMT cost of Rs. 1800 per T2DM patient with CVD risks, then for the cohorts of
294 T2DM patients, the total treatment cost resulted to Rs. 5,29,200. For the same
population, if the proposed Score system is used for screening CVD risk then only 51\%
(150 T2DM patients) of the total sample whose score >=50, are suggested for further
medical investigation. If 150 T2DM patients undergo TMT test, then the treatment cost
would escalate to Rs. 2,70,000. Thus, a cost reduction of almost 49% is achieved using
the proposed Score system. The study is conducted on a representative sample of
T2DM patients in a metropolitan city of India and it can be assumed that the
demographic of this representative sample dataset is similar to the rest of India. Thus,
the obtained result of the Score system can also be extrapolated to other cities of India.
45
Conclusion
• Discovering the associations between T2DM and cardiac risks is the first step to build a
framework for analysing the likelihood of cardiac risks in T2DM patients. In this work, we
propose an efficient machine learning based framework as a pilot study to predict the
probability of cardiac risks in T2DM patients without any surgical interventions. Our
framework leverages machine learning to identify the dominant traits of cardiac risks
that are associated with T2DM. We use Logistic regression algorithm as a classifier and
the prediction accuracy of \textit{classifier} is enhanced by performing some statistical
operations, namely, \textit{Outlier Detection}, \textit{Multicollinearity check}. To increase
the numerical stability and generalizability of the classification model \textit{Feature
Selection } process is used. The proposed framework is evaluated using \textit{hold-out
dataset} and \textit{out-of-sample dataset}.The performance of the proposed
classification model is 88.76\% in \textit{hold-out dataset} and 89.89\% in \textit{out-of-
sample dataset} in terms of prediction accuracy. The \textit{Score system} achieves
93.19\% prediction accuracy for \textit{in-sample dataset} and 92.04\% prediction
accuracy for \textit{out-of-sample-dataset}. The obtained results indicate that the
proposed framework can be a beneficial tool for assessing the possibility of cardiac risks
in T2DM patients and providing timely treatment to T2DM patients with cardiac
symptoms. 46
Future Scope
• Firstly, the number of samples in the dataset needs to be increased in future. Although the
proposed framework with 131 \textit{Cases} and 162 \textit{Controls} achieves higher
prediction accuracy, however, we still need to incorporate some more samples in our dataset
to confirm the model scalability.Secondly, the proposed framework involves human efforts in
identifying significant traits present in T2DM patient due to cardiac risks and designing of
features using those traits. Although, a substantial time is spent on identification of
unconventional and significant traits for cardiac risks and we believe that our selected 13
features can be utilized in other related studies as well.Thirdly, in comparison with the
existing literature, our classification model uses some unconventional but significant features
related to cardiac complications. Thus, performance-based comparative study between our
proposed classification model and existing classification model is not possible. If the study
focuses only on the conventional atherogenic cardiac risks in T2DM, then the existing
algorithms would be a better choice. The proposed framework would be a better choice if
the quality and uniqueness of features have higher priority.Finally, \textit{Obesity},
\textit{Hypertension} that are associated with T2DM, are the significant traits of CVD. But in
our proposed framework, \textit{Obesity} and \textit{Hypertension} are removed from the
feature set by \textit{Backward Deletion} method during \textit{Feature Selection} process.
The quality of \textit{Feature Selection} process can be improved by including
\textit{Obesity} and \textit{Hypertension} in the feature set. 47
Reference

48
49

You might also like