You are on page 1of 61

MTH 3901 RESEARCH PROCESS IN MATHEMATICS

MTH 3901 RESEARCH PROCESS IN


MATHEMATICS
PREDICTION OF HEART DISEASE USING MULTIPLE
LOGISTIC REGRESSION

Chin, H.Y.1 , Yeoh, R.L.W.1 , Ridzuwan, N.H.N.1 ,


Ramanatham, L.1 & Mohamad Afindi, A.S.1

1 Universiti Putra Malaysia

Department of Mathematics

29 January 2021
MTH 3901 RESEARCH PROCESS IN MATHEMATICS

Table of Contents

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Introduction

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Introduction

1.0 Introduction

Cardiovascular disease-associated deaths have hastily


amplified worldwide over the past years, proving to be one of
the principal sources of death amongst humans.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Introduction

In 2016, the World Health Organisation (WHO) reported that


approximately 17.9 million of all deaths globally were due to
cardiovascular diseases, with 15.2 million were a direct
contribution of heart attacks and strokes (WHO, 2017).
Deaths caused by ischaemic heart diseases in 2018 were
recorded at 15.9% and 15% in urban and rural areas of
Malaysia respectively (Department of Statistics Malaysia,
2019).
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Introduction

There are many factors that will contribute to heart disease.


For instance, list of contributing factors comprises obesity,
poor nutritional intake, excessive cholesterol within the body,
and the lack of physical activities.
Furthermore, an array of health conditions could result in a
higher probability of getting heart attacks.
Examining and identifying the primary factors that will cause
the event of heart disease is of great importance.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Introduction

It is essential to have early detection of heart disease before


unexpected passing occurs.
Due to advanced technology and data collection, heart disease
can be predicted under machine learning algorithms efficiently.
Many common machine learning algorithms such as logistic
regression, Naive Bayes, Decision trees, Random Forest,
K-means, and Support-Vector-machine were applied by many
researchers in predicting the event of heart disease.
In this research, multiple logistic regression technique was
chosen to forecast the event of heart disease.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Objective

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Objective

2.0 Objective

To predict the event of heart disease by using multiple


logistic regression.

To study which factors will significantly affect heart


disease.

To find the best fit of the model by using diagnostics


test and forward stepwise regression.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

3.0 Literature Review

No Author(s) Findings

1. Bhatti et al.(2006) Logistic regression analysis is applied


to investigate the factors that caused
greatly of getting the risk of ischemic
heart disease. Overall 98.63% of the to-
tal cases were correctly classified.

2. Nishadi (2019) Using logistic regression to identify the


most significant predictors of heart dis-
eases by evaluating the risk of 10-years
Coronary Heart Disease (CHD) with 14
independent variables. The accuracy of
the model is about 87%.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

3. Saw et al. (2019) Using logistic regression model approach


and find out men are more susceptible
to have heart disease. Age, number of
cigarettes smoke, systolic blood pressure
increase the risk to have heart disease.
The model achieved 87% accuracy.

4. Prasad et al. Predict heart disease with various ma-


(2019) chine learning approaches such as logis-
tic regression , k-nearest neighbors al-
gorithm Naı̈ve Bayes and Decision Tree
then compare them based of the accuracy
of model. Logistic regression achieved
the highest accuracy among those mod-
els which is 86.89%.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

5. Mothukuri et al. Predict heart disease using different mod-


(2020) els such as Support Vector Machine
and Random Forest.The precision of the
model using logistic regression approach
in this research is the highest which is
about 63.93%.

6. Lashmanarao, Heart disease prediction using 3 differ-


Swathi and Sun- ent sampling techniques such as random
dareswar (2019) over sampling, Synthetic Minor Over-
sampling and Adaptive synthetic sam-
pling approach. Among three sam-
pling technques, logistic regression only
achieved 67.5%, 68.8% and 65.7% accu-
racy.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

7. Rajesh et al. Using Nav̈e Bayes algorithm, Decision


(2018) Trees and combination of algorithms
to forecast heart disease. The results
showed that Naı̈ve Bayes performed bet-
ter when the dataset is small while De-
cision trees gives accurate results when
using large dataset.

8. Krishnan and Using Decision trees and Naive Bayes al-


Geetha (2019) gorithm to predict the heart disease. De-
cision tree model achieved 91% accuracy
and Naı̈ve Bayes model achieved 87% ac-
curacy.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Literature Review

9. Yang et al. (2020) Several methods were used to build car-


diovascular disease prediction model such
as multivariate regression model, classifi-
cation nd regression tree (CART), Naı̈ve
Bayes, Bagged trees, Ada Boost and
Random Forest. The Random Forest
methods achieved the highest accuracy
which is 78.72%.

10. Chicco and Jur- 10 machine learning classifiers are used


man (2020) to predict the survival of patients with
heart failure. Random forests achieved
the highest accuracy (74.0%).
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.0 Methodology

1 Multiple Logistic Regression Model


2 Diagnostic Test
3 Forward Stepwise Regression
4 Model Validation and Performance analysis
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.1 Multiple Logistic Regression Model

Suppose in a muliple logistic regression case, a collection of p


explanatory variables be denoted by the vector

X 0 = (X1 , X2 , . . . , Xp )

Let the conditional probability that the outcome is present be


denoted as
P (Y = 1|X) = π
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Thus, the logistic regression model can be written in terms of the


odds of an event occurring.
 
π̂i
ln = β0 + β1 X1 + β2 X2 + . . .. . .. . . + βp Xp (1)
1 − π̂i
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Maximum Likelihood Estimation

Suppose (y1 , y2 , . . .yn ) be the n independent random observations


corresponding to the random variables (Y1 , Y2 , . . .Yn ).

Since the Yi is a Bernoulli random variables, thus the probability


function of Yi is

fi (Yi ) = πi Yi (1 − πi )1−Yi ; Yi = 0 or 1 ; i = 1, 2. . .n (2)


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

As the Y’s are assumed to be independent, the likelihood function


is given by
n
X n
X
πi Xi0 β ln(1 + exp(Xi0 β)

L(β) = − (3)
i=1 i=1

where

(β = β0 , β1 , β2 , . . ., βp )
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Thus, the fitted logistic response function can then be expressed as


follows:

exp(X 0 β)
π̂ = = |1 + exp(−X 0 β)|−1 (4)
1 + exp(X 0 β)

X 0 β = β0 + β1 X1 + · · · + βp Xp (5)
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.2 Diagnostic Test

1. Test Linearity in the logit for continuous independent


variables.

Box-Tidwell Transformation
- Interaction between each independent variable and its
natural logarithm are added into the logistic regression model
- If at least one interaction is significant then assumption is
violated
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

2. Test independence of errors


Plot residual against order of observation
- If plot shows no any trend of pattern and fluctuate around
baseline 0, the errors terms are indepedent.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

3. Test multicollinearity among independent variables

Pearson’s correlation coefficient

cov(x, y)
ρx,y =
σx σy
where cov is the covariance σx is the standard deviation of x and σy is
the standard deviation of y
*If the Pearson’s correlation coefficient between two regressors is
greater than 0.8, the multicollinearity is a serious problem.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Variance Inflation Factor (VIF)

1
V IF = (6)
1 − R2

where R2 is the coefficient of determination for the regression of that


explanatory variable on all remaining independent variables.

*If V IF >10, indicate presence of multicollinearity problem.


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.Finding influential outliers

Studentized Pearson residuals

rpi
rspi = √ (7)
1 − hii

rpi : Pearson residuals


hii : Leverage Values

*If the studentized residuals is less than -3 and greater than


+3,represent possible outliers and deserve closer attention.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Leverage Value

(k + 1)
p= (8)
n

k: Number of independents
n: Sample size

*Two times of the average leverage is declared as influential .


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Cook’s distance

rp2i hii
Di = (9)
p(1 − hii )2

rpi : Pearson residuals


hii : Leverage value
p: Number of parameter estimated

*If Cook’s distance > 1 for the individual case,then it will have an
effect on the estimated coefficients .
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.3 Forward Stepwise Regression

Forward stepwise is the method of fitting a regression model by


determining predictor variables that are significant to the response.

Akaike Informationn Criterion (AIC)

AICp = −2loge L(b) + 2p (10)

where p is the number of parameter abd loge L(b) is the log-likelihood


expression.

*model with the minium AIC value is chosen.


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

4.4 Model Validation and Performance analysis

Hosmer-Lemeshow goodness-of-fit
The Hosmer-Lemeshow test statistic :

g
X (Oj − Ej )2
Cv = (11)
n π¯ (1 − π¯j )
j=1 j j
P
Oj = yj be the number of positive responses among the covariate
patterns falling in the jth decile.
Ej is the estimate of the expected value of Oj under the assumption that
the fitted model is correct
*If the p-value smaller than significance level, conclude not a good
fit of model
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Confusion Matrix
The confusion matrix is a contingency table that shows the
number of instances assigned to each class.

Table 1 : Layout of 2x2 Confusion Matrix


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Table 2 : Classification Measure by using Confusion Matrix.


Formula Terms
T P +T N
T P +T N +F P +F N
Accuracy of the model
F P +F N
T P +T N +F P +F N
Misclassification Rate
TP
T P +F N
Sensitivity or True Positive Rate
TN
T N +F P
Specificity or True Negative Rate
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Methodology

Receiver operating characteristic (ROC) curve


ROC curve is a graph which visualise the performance of a
classification model at all classification thresholds.
This curve plots TPR against FPR.

Area Under the ROC curve (AUC)


-The maximum value of AUC is 1.

-The minimum value of AUC can be considered is 0.5 instead


of 0.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.0 Results and Discussion

1 Descriptive Statistics
2 Diagnostic Test
3 Multiple Logistic Regression
4 Forward Stepwise Regression
5 Model Validation and Performance Analysis
6 Conclusion
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

There are total 13 results in this research.

The results were obtained separately from:

IBM SPSS software


R Software
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.1 Descriptive Statistics

Figure 1: Summary of the Descriptive Statistic.


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.2 Diagnostic tests


Assumption 1:
Linearity of independent variables and log odds
Figure 2: The Box-Tidwell Test result from the IBM-SPSS.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Assumption 2: Independence of errors

Figure 3 : Residual Plot.

Residual versus Order of Observation


25
20
15
Residual

10
5
0
−5

0 100 200 300 400

Order of Observation
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Assumption 3:
Multicollinearity among independent variables
Figure 4: Collinearity Matrix between the independent variables.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 5: Variance Inflation Factor (VIF) of 9 independent variables.


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Assumption 4: Influential outliers

Figure 6: Actual number of observations with y=1, Number of observation,


predicted number of observation with y=1, Pearson residual, Deviance residual,
Leverage, Studentized Pearson residual, Standardized Deviance Residual, Change in
Pearson chi-square, Change in deviance statistic, Change in Bhat observation.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 7: 9.Studentized residual against observation number,


Leverage against observation number and
Cook’s distance against observation number.
Studentized residuals

−1

−2
0 100 200 300 400
Observation number
(labelled points=resid.>|3|)

346
115 462
0.100
372
Leverage

17 155 398
0.075 82 334
27 231 375 414
106
0.050

0.025

0.000
0 100 200 300 400
Observation number
(labelled points=leverage> 0.065 )
Cook's distance

0.03

0.02

0.01

0.00
0 100 200 300 400
Observation number
(labelled points=Cook's dist.>1)
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 8: Studentized residual against observation number.


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.3 Logistic Regression (Full Model)

Figure 9: Result when Training Set is fitted with Logistic Model


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.4 Forward Stepwise Logistic Regression


Figure 10: Step 1 and 2 of Forward Stepwise Regression
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 11: Step 3 and 4 of Forward Stepwise Regression


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 12: Step 5 and 6 of Forward Stepwise Regression


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

Figure 13 : Summary of Forward Stepwise Rergression

 
π̂
log 1−π̂ = −6.52744 + 0.06435 ∗ age + 0.98935 ∗ f amhist +
0.08795 ∗ tobacco + 0.03135 ∗ typea + 0.10832 ∗ ldl
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.5 Model Validation and Performance Analysis

A.Hosmer-Lemeshow Goodness-of-fit-test:
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

B.Confusion Matrix:
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

C. Receiver Operating Characteristic (ROC) curve

1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0

1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2


Specificity

Area Under Curve = 0.6743


MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Results and Discussion

5.6 Conclusion

Age, family history of heart disease, cumulative tobacco,


typeA characteristic and Low-density lipoprotein are the
essential factors that causes the heart diseases.
The model achieved 60.42% sensitivity (percentage of people
with heart disease were actually correct identified).
The model achieved 74.44% specificity (percentage of people
without heart disease were actually correctly identified).
The model achieved 69.57% accuracy.
The value of AUC is 0.6743 indicate this model is satisfactory.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Future Work

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
Future Work

6.0 Future Work

Involve other machine learning techniques.


Decision Tree, Naı̈ve Bayes algorithm and Random Forest.
Compare different machine learning techniques.
Involve sampling techniques.
Synthetic Minor Oversampling and Adaptive Synthetic
Sampling approach.
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
References

Outline

1 Introduction

2 Objective

3 Literature Review

4 Methodology

5 Results and Discussion

6 Future Work

7 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
References

7.0 References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
References
MTH 3901 RESEARCH PROCESS IN MATHEMATICS
References

You might also like