Presentation 4

A Framework for Cardiovascular Risks
Prediction in T2DM using Logistic Regression

and Score System
1
Content
 Introduction
 Challenges
 Literature Survey
 Motivation
 Proposed methodology
 Experimental result
 Conclusion
 Future scope
 Reference
2
Introduction
• Type 2 diabetes mellitus[T2DM] is a deadliest disease affecting 90% of people
worldwide
• The development of T2DM is caused by a combination of lifestyle and genetic
factors
• Cardiovascular disease(CVD) is a serious long-term diabetes complication
• The risk of CVD increases by 2 to 4 times for T2DM patients
• The mortality rate of CVD in Indian T2DM patients is 70%
3
Challenges( ki ki prb ache ei field e )
• Detection of CVD using traditional procedure is cumbersome and costly
• Analysing the risk factors of CVD involved in T2DM is laborious process
• Assessing the probability of CVD in T2DM patients
4
Related Work
• The European Associations for the study of Diabetes recommend FRAMINGHAM [] as a
CVD risk prediction model
Limitations
Applicable to the general population
 The risk of CVD in T2DM patient is underrated
• Guidelines of International Diabetes Federation recommend the UKPDS risk engine []

as a CVD risk prediction model
Limitations
The discriminating performance of the model suffers from variation and poor calibration
The associations between T2DM and CVD is not addressed
5
Related Work
Year Author Data Mining Algorithms Accuracy
2007 Yanwei X et al. ANN, DT 91%, 89.6%

2012 Peter et al. Naïve Bayes, DT, ANN 83.70%, 76.66%, 78.14%
2012 Maishowman et SVM with Bagging algorithm 84.1%

al.
2013 Sellappan et al. Naïve Bayes, DT, ANN 85.68%, 86.12%,
80.4%
2013 Bahadur et al. DT, Naïve Bayes 99.1%, 96.5%, 88.3%
2017 Heryadi et al. DT 97.3%

2017 Zarkogianni et al. Hybrid Wavlet Neural Network 71.48%
and Self-Organizing maps
Table 1: Accuracy comparison of Data Mining algorithms used in CVD prediction

6
Motivation (ki ki flaws ache related work)
 Discovering the associations between T2DM and CVD
 Preventing the adverse impact due to delayed diagnosis of CVD risks
 Predicting atherogenic CVD risk Prevalence in T2DM patients
 Providing cost-effective treatments for involved CVD risks in T2DM patients

 Implementing a score system that helps to diagnosticate and ascertain the CVD risks in
T2DM by preventing costly medical complications
7
Proposed methodology(1/)
Feature Original
Feature Selected
Construction Feature
Selection Features
set
Data
420 samples 294 samples
of T2DM of T2DM 162 Training Classification
processing algorithm
patients patients Cases Data
Sample
labelling Testing
Data
132
Controls
Final Estimated
evaluation Accuracy
Figure 1: Block diagram of the proposed framework 8
Data Collection
 Information of 420 T2DM patients is gathered.
 15 significant traits of CVD are used to construct dataset.
 420 samples of T2DM patients are labelled as
 Type 2 diabetic patients with known CVD

 Type 2 diabetic patients without CVD
9
Data Preparation
 Samples having missing attributive values are discarded.
 294 samples of T2DM patients are obtained.
 The dataset is divided in two set
 The first set used for model training.

 The Second set used for hold-out model validation.
10
Feature Description
 A unique measurable property of a phenomenon.
 Informative features raise the performance level of a classifier.
 13 features are build using samples of dataset.
 The entire feature set is divided into two groups

 Categorical features
 Numerical features
11
Name of features Category Description

Age Numerical Age of each patient in years
Gender Categorical Sex of patient. Possible values are Male and Female
Obesity Categorical Obese nature of patient. Possible values are Normal or
Obese
Family history Categorical Family history of CVD. Possible values are Yes or No
Foot Infection Categorical Presence of foot infection. Possible values are Yes or No
Smoking Categorical Smoking habit of patient. Possible values are Yes or No
Alcohol Categorical Alcohol habit of patient. Possible values are Yes or No
Duration of T2DM Numerical Presence of T2DM in years
Table 2: Feature Description 12

Name of features Category Description

Hypertension Categorical Hypertension nature of the patient. Possible values are
Yes and No
PPBS value Numerical Postprandial glucose value of the patient
FBS value Numerical Fasting blood sugar value of the patient
Family history Categorical Family history of CVD. Possible values are Yes or No
Dyslipidemia Categorical Presence of dyslipidemia. Possible values are Yes or No
Cholesterol Categorical Presence of cholesterol. Possible values are Yes or No
Table 2: Feature Description
13
Classification model
 A component of ML algorithms to extract patterns from data
 Several classification algorithms are used in CVD risk prediction
 Logistic regression can be used in developing predictive models
 To improve the accuracy of classifier following operations are performed

 Outlier detection
 Multicollinearity Check
 Feature Selection
14
Outlier detection
 Detection of outliers is a crucial task for classification models.
 Outlier is a point that is numerically distant from other data points.
 The following methods are used to identify outliers

 Cook’s Distance
 DFFITS
15
•
Cook’s distance
 Detects outliers in multivariate data.
 The formula of Cook's Distance method can be defined as
Di =
Where,
is the prediction value for observation j
is the prediction value for observation j without including point i
σ2 is the estimation of error variance.
p is the number of independent variables present in the model.
16
•
DFFITS method
 Measures the impact of a observation on the response value
 The formula used by DFFITS method can be defined as

(DFFITS)i =
where,
the prediction value for observation j
is the prediction value for observation i

S(i) is the estimation of standard error
is the leverage for the observation i.
17
Multicollinearity Check
 Multicollinearity is a significant problem in classification algorithms.
 Increases the variance of coefficient estimation process.
 Multicollinearity check is performed using Generalized Variance Influence Factor

(GVIF).
 The acceptable standard for Multicollinearity detection using GVIF is considered.
18
Feature Selection
 A process of identifying subset of features
 Removes unnecessary features, increases the quality of prediction
 Finds most parsimonious model in terms of performance and lower computational

cost
 We adopt three methods of Wrapper model for Feature Selection

 Forward Selection
 Backward Deletion
 Stepwise Regression
19
Label
information
Learning
algorithm
Feature
selection
Feature Features
Training set generation Classifier
Figure 2: General Feature Selection framework for Classification model

20
ROC Curve
 Demonstrates the diagnostic ability of a binary classifier
 Represents a sensitivity/speciﬁcity pair corresponding to a particular decision

threshold
 Identifies the optimum value of decision threshold for the proposed classifier
21
Model Validation
 Process of determining resemblance of the model with the real data.
 It can be performed using two types of dataset

 Hold-out dataset
 Out-of-sample dataset
 Hold-out model validation is done by 120 samples of T2DM patient
 Out-of-sample model validation is done by 90 samples of another dataset
22
Current scenario
 Markable flare of CVD risks in T2DM patients is observed in India
 Majority of such cases remain undetected and uncared.
 Traditional CVD risks investigation costs is exorbitant
 Increases the burden on the societal and economic factors.
23
Name of Test Average Purpose
price
TMT Rs. 1800 To asses the functionality of heart and blood vessels
Holter Monitoring Rs. 2000 To assess heart abnormalities

Echocardiogram(ECHO) Rs. 1600 To monitor heart rate and heart valves
Electrocardiogram(ECG) Rs. 400 To detect irregularities in heart rhythm
Coronary angiography Rs. 6000 To check blood vessel problems and heart
abnormalities
Carotid ultrasound Rs. 1000 To assess the risk of stroke
Table 3: Economic burden(macro level): Cost of Different Tests in CVD Complications

24
Proposed methodology(17/17)
Score System
 Categorizes the possibility of occurrence of CVD complications in T2DM
 Score for numerical feature is determined using Logistic regression curve
 Score for categorical feature is determined by observing the associations

between features
 Total score is calculated by adding scores of each feature
 A cost-effective tool for screening CVD risks in T2DM patients(conclusion)
25
Experimental Analysis(1/
Outlier Detection analysis
Cook’s distance method finds the following observations as suspicious
28, 172, 67, 265, 102, 58, 70, 215
Figure 3: Outlier detection using Cook's Distance method 26

DFFITS method finds the following observations as suspicious
28, 172, 67, 265, 102, 58, 70, 215
Figure 4: Outlier detection using DFFITS method 27

Multicollinearity Check analysis
Features GVIF Df GVIF(1/(2*Df))
Age 1.6569 1 1.2872
Gender 3.7622 1 1.9396
Family history 1.1565 1 1.0754
Foot infection 1.1257 1 1.0610
Smoking 3.7245 1 1.9299
Alcohol 1.8232 1 1.3502
Duration of diabetes 1.1666 1 1.0801
Obesity 1.5729 1 1.0582
Hypertension 1.1694 1 1.0814
FBS Value 2.0692 1 1.4385
Dyslipidemia 1.3333 1 1.1546
Table 4: Multicollinearity Check analysis 28

 Backward Deletion method selects the following features from the original set of
features
 Age
 Gender
 Family history
 Foot infection
 Smoking
 Duration of diabetes
 Obesity
 Hypertension
 FBS Value
 Dyslipidemia
 Cholesterol
29
Table 5: Summary of Backward Deletion method
Table 6: Performance comparison of different Feature Selection methods

30
Figure 5: Gender(T2DM patient) vs. Model Response

31
Figure 6: Smoking habit(T2DM patient) vs. Model Response

32
Figure 7: Family history (T2DM sample) vs. Model Response 33

Figure 8: Dyslipidemia level(T2DM sample) vs. Model Response

34
Figure 9: Cholesterol level(T2DM sample) vs. Model Response 35

ROC Curve Analysis
Figure 10: ROC Curve Analysis 36

• Sensitivity and specificity table ta hbe
37
Score System analysis
• Score system is developed using the features identified by Backward Deletion or
Stepwise Regression method.
• Score assignment for each individual risk factor is done with utmost care
• The efficacy of the Score system is evaluated using in-sample dataset and out-
of-sample dataset.
38
Name of features Score

Age
Age < 41 years 1
Age < 51 years 3
Age < 61 years 5.5
Age < 71 years 7.5
Age < 100 years 8.6
Gender
Male 7
Female 8
Table 7: Score Assignment for Individual Factors

39

Family history
If Family history is present and Gender of the patient is Male 5.5
If Family history is present and Gender of the patient is Female 6
Smoking 5.5
If the Gender of patient is Female and Smoking habit is level 1 8
If the Gender of patient is Male and Smoking habit is level 1 7
Duration of diabetes
Duration <11 years 2.75
Duration < 21 years 6.75
Table 7: Score Assignment for Individual Factors

40
FBS value
FBS value <201 2.5
FBS value <351 6.5
FBS value <601 8.75
Dyslipidemia
If Dyslipidemia is present and Family history is level 1 6
If Dyslipidemia is present and Family history is level 0 5

Foot Infection
If Foot Infection is present and Gender of the patient is Female
If Foot Infection is present and Gender of the patient is Male
Table 7: Score Assignment for Individual Factors 41

Cholesterol
If Cholesterol is present and Smoking habit is level 1 and the Gender of the 7
patient is Male and Age is <44 years
If Cholesterol is present and Smoking habit is level 1 and Gender of the patient is 8
Male and Age is >=45 years
Male and Age is < 44 years
Male and Age is >=45 years
Female and Age is < 54 years
Female and Age is >= 55 years

Female and Age is <54 years
If Cholesterol is present and if Smoking habit is level 0 and Gender of the patient 7
is Female and Age is >= 55 years
Obesity
If Obesity is present Gender of the patient is Male
If Obesity is present Gender of the patient is Female
Hypertension
If Hypertension is present and Gender of the patient is Female
If Hypertension is present and Gender of the patient is Female

Validation Accuracy Sensitivity Specificity

dataset
Classification Hold-out dataset
Model Out-of-sample
dataset
Score system In-sample-dataset
Out-of-sample
dataset
Table 8: Performance Summary for Classification Model and Score System
44
• A machine learning-based framework is developed to diagnosticate and ascertain the

CVD risks in T2DM by preventing costly medical complications. In addition, the
predictive power of Logistic regression is explored as a classifier to predict the presence
or absence of a characteristic or outcome based on the values of some set of predictor
variables. The proposed Score system helps in cost-effective screening for CVD risk in
T2DM patients as it uses simple, safe and inexpensive measures. Considering the
average TMT cost of Rs. 1800 per T2DM patient with CVD risks, then for the cohorts of
294 T2DM patients, the total treatment cost resulted to Rs. 5,29,200. For the same
population, if the proposed Score system is used for screening CVD risk then only 51\%
(150 T2DM patients) of the total sample whose score >=50, are suggested for further
medical investigation. If 150 T2DM patients undergo TMT test, then the treatment cost
would escalate to Rs. 2,70,000. Thus, a cost reduction of almost 49% is achieved using
the proposed Score system. The study is conducted on a representative sample of
T2DM patients in a metropolitan city of India and it can be assumed that the
demographic of this representative sample dataset is similar to the rest of India. Thus,
the obtained result of the Score system can also be extrapolated to other cities of India.
45
Conclusion
• Discovering the associations between T2DM and cardiac risks is the first step to build a
framework for analysing the likelihood of cardiac risks in T2DM patients. In this work, we
propose an efficient machine learning based framework as a pilot study to predict the
probability of cardiac risks in T2DM patients without any surgical interventions. Our
framework leverages machine learning to identify the dominant traits of cardiac risks
that are associated with T2DM. We use Logistic regression algorithm as a classifier and
the prediction accuracy of \textit{classifier} is enhanced by performing some statistical
operations, namely, \textit{Outlier Detection}, \textit{Multicollinearity check}. To increase
the numerical stability and generalizability of the classification model \textit{Feature
Selection } process is used. The proposed framework is evaluated using \textit{hold-out
dataset} and \textit{out-of-sample dataset}.The performance of the proposed
classification model is 88.76\% in \textit{hold-out dataset} and 89.89\% in \textit{out-of-
sample dataset} in terms of prediction accuracy. The \textit{Score system} achieves
93.19\% prediction accuracy for \textit{in-sample dataset} and 92.04\% prediction
accuracy for \textit{out-of-sample-dataset}. The obtained results indicate that the
proposed framework can be a beneficial tool for assessing the possibility of cardiac risks
in T2DM patients and providing timely treatment to T2DM patients with cardiac
symptoms. 46
Future Scope
• Firstly, the number of samples in the dataset needs to be increased in future. Although the
proposed framework with 131 \textit{Cases} and 162 \textit{Controls} achieves higher
prediction accuracy, however, we still need to incorporate some more samples in our dataset
to confirm the model scalability.Secondly, the proposed framework involves human efforts in
identifying significant traits present in T2DM patient due to cardiac risks and designing of
features using those traits. Although, a substantial time is spent on identification of
unconventional and significant traits for cardiac risks and we believe that our selected 13
features can be utilized in other related studies as well.Thirdly, in comparison with the
existing literature, our classification model uses some unconventional but significant features
related to cardiac complications. Thus, performance-based comparative study between our
proposed classification model and existing classification model is not possible. If the study
focuses only on the conventional atherogenic cardiac risks in T2DM, then the existing
algorithms would be a better choice. The proposed framework would be a better choice if
the quality and uniqueness of features have higher priority.Finally, \textit{Obesity},
\textit{Hypertension} that are associated with T2DM, are the significant traits of CVD. But in
our proposed framework, \textit{Obesity} and \textit{Hypertension} are removed from the
feature set by \textit{Backward Deletion} method during \textit{Feature Selection} process.
The quality of \textit{Feature Selection} process can be improved by including
\textit{Obesity} and \textit{Hypertension} in the feature set. 47
Reference
48
49

Presentation 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation 4

Uploaded by

Copyright:

Available Formats

A Framework for Cardiovascular Risks

Prediction in T2DM using Logistic Regression

• Detection of CVD using traditional procedure is cumbersome and costly

• Analysing the risk factors of CVD involved in T2DM is laborious process

• Assessing the probability of CVD in T2DM patients

 The risk of CVD in T2DM patient is underrated

• Guidelines of International Diabetes Federation recommend the UKPDS risk engine []

2007 Yanwei X et al. ANN, DT 91%, 89.6%

2012 Maishowman et SVM with Bagging algorithm 84.1%

2017 Heryadi et al. DT 97.3%

Table 1: Accuracy comparison of Data Mining algorithms used in CVD prediction

 Providing cost-effective treatments for involved CVD risks in T2DM patients

 Information of 420 T2DM patients is gathered.

 15 significant traits of CVD are used to construct dataset.

 420 samples of T2DM patients are labelled as

 Type 2 diabetic patients with known CVD

 Samples having missing attributive values are discarded.

 294 samples of T2DM patients are obtained.

 The dataset is divided in two set

 The first set used for model training.

 Informative features raise the performance level of a classifier.

 13 features are build using samples of dataset.

 The entire feature set is divided into two groups

Name of features Category Description

Alcohol Categorical Alcohol habit of patient. Possible values are Yes or No

Duration of T2DM Numerical Presence of T2DM in years

Table 2: Feature Description 12

Name of features Category Description

Dyslipidemia Categorical Presence of dyslipidemia. Possible values are Yes or No

Cholesterol Categorical Presence of cholesterol. Possible values are Yes or No

Table 2: Feature Description

 Several classification algorithms are used in CVD risk prediction

 Logistic regression can be used in developing predictive models

 To improve the accuracy of classifier following operations are performed

 Outlier is a point that is numerically distant from other data points.

 The following methods are used to identify outliers

 The formula of Cook's Distance method can be defined as

 The formula used by DFFITS method can be defined as

is the prediction value for observation i

 Multicollinearity is a significant problem in classification algorithms.

 Increases the variance of coefficient estimation process.

 Multicollinearity check is performed using Generalized Variance Influence Factor

 The acceptable standard for Multicollinearity detection using GVIF is considered.

 Removes unnecessary features, increases the quality of prediction

 Finds most parsimonious model in terms of performance and lower computational

 We adopt three methods of Wrapper model for Feature Selection

Figure 2: General Feature Selection framework for Classification model

 Demonstrates the diagnostic ability of a binary classifier

 Represents a sensitivity/speciﬁcity pair corresponding to a particular decision

 It can be performed using two types of dataset

 Out-of-sample model validation is done by 90 samples of another dataset

 Majority of such cases remain undetected and uncared.

 Traditional CVD risks investigation costs is exorbitant

 Increases the burden on the societal and economic factors.

Holter Monitoring Rs. 2000 To assess heart abnormalities

Electrocardiogram(ECG) Rs. 400 To detect irregularities in heart rhythm

Table 3: Economic burden(macro level): Cost of Different Tests in CVD Complications

 Score for numerical feature is determined using Logistic regression curve

 Score for categorical feature is determined by observing the associations

 Total score is calculated by adding scores of each feature

 A cost-effective tool for screening CVD risks in T2DM patients(conclusion)

Figure 3: Outlier detection using Cook's Distance method 26

Figure 4: Outlier detection using DFFITS method 27