MTH3901 MINI PROJECT SLIDE 2021 Latest Copy

MTH 3901 RESEARCH PROCESS IN MATHEMATICS
MTH 3901 RESEARCH PROCESS IN

MATHEMATICS
PREDICTION OF HEART DISEASE USING MULTIPLE
LOGISTIC REGRESSION
Chin, H.Y.1 , Yeoh, R.L.W.1 , Ridzuwan, N.H.N.1 ,

Ramanatham, L.1 & Mohamad Afindi, A.S.1
1 Universiti Putra Malaysia
Department of Mathematics
29 January 2021
Table of Contents
1 Introduction
2 Objective
3 Literature Review
4 Methodology
5 Results and Discussion
6 Future Work
7 References
Introduction
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
Introduction
1.0 Introduction
Cardiovascular disease-associated deaths have hastily

amplified worldwide over the past years, proving to be one of
the principal sources of death amongst humans.
Introduction
In 2016, the World Health Organisation (WHO) reported that

approximately 17.9 million of all deaths globally were due to
cardiovascular diseases, with 15.2 million were a direct
contribution of heart attacks and strokes (WHO, 2017).
Deaths caused by ischaemic heart diseases in 2018 were
recorded at 15.9% and 15% in urban and rural areas of
Malaysia respectively (Department of Statistics Malaysia,
2019).
Introduction
There are many factors that will contribute to heart disease.

For instance, list of contributing factors comprises obesity,
poor nutritional intake, excessive cholesterol within the body,
and the lack of physical activities.
Furthermore, an array of health conditions could result in a
higher probability of getting heart attacks.
Examining and identifying the primary factors that will cause
the event of heart disease is of great importance.
Introduction
It is essential to have early detection of heart disease before

unexpected passing occurs.
Due to advanced technology and data collection, heart disease
can be predicted under machine learning algorithms efficiently.
Many common machine learning algorithms such as logistic
regression, Naive Bayes, Decision trees, Random Forest,
K-means, and Support-Vector-machine were applied by many
researchers in predicting the event of heart disease.
In this research, multiple logistic regression technique was
chosen to forecast the event of heart disease.
Objective
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
Objective
2.0 Objective
To predict the event of heart disease by using multiple

logistic regression.
To study which factors will significantly affect heart

disease.
To find the best fit of the model by using diagnostics

test and forward stepwise regression.
Literature Review
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
Literature Review
3.0 Literature Review
No Author(s) Findings
1. Bhatti et al.(2006) Logistic regression analysis is applied

to investigate the factors that caused
greatly of getting the risk of ischemic
heart disease. Overall 98.63% of the to-
tal cases were correctly classified.
2. Nishadi (2019) Using logistic regression to identify the

most significant predictors of heart dis-
eases by evaluating the risk of 10-years
Coronary Heart Disease (CHD) with 14
independent variables. The accuracy of
the model is about 87%.
Literature Review
3. Saw et al. (2019) Using logistic regression model approach

and find out men are more susceptible
to have heart disease. Age, number of
cigarettes smoke, systolic blood pressure
increase the risk to have heart disease.
The model achieved 87% accuracy.
4. Prasad et al. Predict heart disease with various ma-

(2019) chine learning approaches such as logis-
tic regression , k-nearest neighbors al-
gorithm Naı̈ve Bayes and Decision Tree
then compare them based of the accuracy
of model. Logistic regression achieved
the highest accuracy among those mod-
els which is 86.89%.
Literature Review
5. Mothukuri et al. Predict heart disease using different mod-

(2020) els such as Support Vector Machine
and Random Forest.The precision of the
model using logistic regression approach
in this research is the highest which is
about 63.93%.
6. Lashmanarao, Heart disease prediction using 3 differ-

Swathi and Sun- ent sampling techniques such as random
dareswar (2019) over sampling, Synthetic Minor Over-
sampling and Adaptive synthetic sam-
pling approach. Among three sam-
pling technques, logistic regression only
achieved 67.5%, 68.8% and 65.7% accu-
racy.
Literature Review
7. Rajesh et al. Using Nav̈e Bayes algorithm, Decision

(2018) Trees and combination of algorithms
to forecast heart disease. The results
showed that Naı̈ve Bayes performed bet-
ter when the dataset is small while De-
cision trees gives accurate results when
using large dataset.
8. Krishnan and Using Decision trees and Naive Bayes al-

Geetha (2019) gorithm to predict the heart disease. De-
cision tree model achieved 91% accuracy
and Naı̈ve Bayes model achieved 87% ac-
curacy.
Literature Review
9. Yang et al. (2020) Several methods were used to build car-

diovascular disease prediction model such
as multivariate regression model, classifi-
cation nd regression tree (CART), Naı̈ve
Bayes, Bagged trees, Ada Boost and
Random Forest. The Random Forest
methods achieved the highest accuracy
which is 78.72%.
10. Chicco and Jur- 10 machine learning classifiers are used

man (2020) to predict the survival of patients with
heart failure. Random forests achieved
the highest accuracy (74.0%).
Methodology
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
Methodology
4.0 Methodology
1 Multiple Logistic Regression Model

2 Diagnostic Test
3 Forward Stepwise Regression
4 Model Validation and Performance analysis
Methodology
4.1 Multiple Logistic Regression Model
Suppose in a muliple logistic regression case, a collection of p

explanatory variables be denoted by the vector
X 0 = (X1 , X2 , . . . , Xp )
Let the conditional probability that the outcome is present be

denoted as
P (Y = 1|X) = π
Methodology
Thus, the logistic regression model can be written in terms of the

odds of an event occurring.

π̂i
ln = β0 + β1 X1 + β2 X2 + . . .. . .. . . + βp Xp (1)
1 − π̂i
Methodology
Maximum Likelihood Estimation
Suppose (y1 , y2 , . . .yn ) be the n independent random observations

corresponding to the random variables (Y1 , Y2 , . . .Yn ).
Since the Yi is a Bernoulli random variables, thus the probability

function of Yi is
fi (Yi ) = πi Yi (1 − πi )1−Yi ; Yi = 0 or 1 ; i = 1, 2. . .n (2)

Methodology
As the Y’s are assumed to be independent, the likelihood function

is given by
n
X n
X
πi Xi0 β ln(1 + exp(Xi0 β)

L(β) = − (3)
i=1 i=1
where
(β = β0 , β1 , β2 , . . ., βp )
Methodology
Thus, the fitted logistic response function can then be expressed as

follows:
exp(X 0 β)
π̂ = = |1 + exp(−X 0 β)|−1 (4)
1 + exp(X 0 β)
X 0 β = β0 + β1 X1 + · · · + βp Xp (5)
Methodology
4.2 Diagnostic Test
1. Test Linearity in the logit for continuous independent

variables.
Box-Tidwell Transformation
- Interaction between each independent variable and its
natural logarithm are added into the logistic regression model
- If at least one interaction is significant then assumption is
violated
Methodology
2. Test independence of errors

Plot residual against order of observation
- If plot shows no any trend of pattern and fluctuate around
baseline 0, the errors terms are indepedent.
Methodology
3. Test multicollinearity among independent variables
Pearson’s correlation coefficient
cov(x, y)
ρx,y =
σx σy
where cov is the covariance σx is the standard deviation of x and σy is
the standard deviation of y
*If the Pearson’s correlation coefficient between two regressors is
greater than 0.8, the multicollinearity is a serious problem.
Methodology
Variance Inflation Factor (VIF)
1
V IF = (6)
1 − R2
where R2 is the coefficient of determination for the regression of that

explanatory variable on all remaining independent variables.
*If V IF >10, indicate presence of multicollinearity problem.

Methodology
4.Finding influential outliers
Studentized Pearson residuals
rpi
rspi = √ (7)
1 − hii
rpi : Pearson residuals

hii : Leverage Values
*If the studentized residuals is less than -3 and greater than

+3,represent possible outliers and deserve closer attention.
Methodology
Leverage Value
(k + 1)
p= (8)
n
k: Number of independents
n: Sample size
*Two times of the average leverage is declared as influential .

Methodology
Cook’s distance
rp2i hii
Di = (9)
p(1 − hii )2
rpi : Pearson residuals

hii : Leverage value
p: Number of parameter estimated
*If Cook’s distance > 1 for the individual case,then it will have an
effect on the estimated coefficients .
Methodology
4.3 Forward Stepwise Regression
Forward stepwise is the method of fitting a regression model by

determining predictor variables that are significant to the response.
Akaike Informationn Criterion (AIC)
AICp = −2loge L(b) + 2p (10)
where p is the number of parameter abd loge L(b) is the log-likelihood

expression.
*model with the minium AIC value is chosen.

Methodology
4.4 Model Validation and Performance analysis
Hosmer-Lemeshow goodness-of-fit
The Hosmer-Lemeshow test statistic :
g
X (Oj − Ej )2
Cv = (11)
n π¯ (1 − π¯j )
j=1 j j
P
Oj = yj be the number of positive responses among the covariate
patterns falling in the jth decile.
Ej is the estimate of the expected value of Oj under the assumption that
the fitted model is correct
*If the p-value smaller than significance level, conclude not a good
fit of model
Methodology
Confusion Matrix
The confusion matrix is a contingency table that shows the
number of instances assigned to each class.
Table 1 : Layout of 2x2 Confusion Matrix

Methodology
Table 2 : Classification Measure by using Confusion Matrix.

Formula Terms
T P +T N
T P +T N +F P +F N
Accuracy of the model
F P +F N
T P +T N +F P +F N
Misclassification Rate
TP
T P +F N
Sensitivity or True Positive Rate
TN
T N +F P
Specificity or True Negative Rate
Methodology
Receiver operating characteristic (ROC) curve

ROC curve is a graph which visualise the performance of a
classification model at all classification thresholds.
This curve plots TPR against FPR.
Area Under the ROC curve (AUC)

-The maximum value of AUC is 1.
-The minimum value of AUC can be considered is 0.5 instead

of 0.
Results and Discussion
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
5.0 Results and Discussion
1 Descriptive Statistics
2 Diagnostic Test
3 Multiple Logistic Regression
4 Forward Stepwise Regression
5 Model Validation and Performance Analysis
6 Conclusion
There are total 13 results in this research.
The results were obtained separately from:
IBM SPSS software

R Software
5.1 Descriptive Statistics
Figure 1: Summary of the Descriptive Statistic.

5.2 Diagnostic tests

Assumption 1:
Linearity of independent variables and log odds
Figure 2: The Box-Tidwell Test result from the IBM-SPSS.
Assumption 2: Independence of errors
Figure 3 : Residual Plot.
Residual versus Order of Observation

25
20
15
Residual
10
5
0
−5
0 100 200 300 400
Order of Observation
Assumption 3:
Multicollinearity among independent variables
Figure 4: Collinearity Matrix between the independent variables.
Figure 5: Variance Inflation Factor (VIF) of 9 independent variables.

Assumption 4: Influential outliers
Figure 6: Actual number of observations with y=1, Number of observation,

predicted number of observation with y=1, Pearson residual, Deviance residual,
Leverage, Studentized Pearson residual, Standardized Deviance Residual, Change in
Pearson chi-square, Change in deviance statistic, Change in Bhat observation.
Figure 7: 9.Studentized residual against observation number,

Leverage against observation number and
Cook’s distance against observation number.
Studentized residuals
−1
−2
0 100 200 300 400
Observation number
(labelled points=resid.>|3|)
346
115 462
0.100
372
Leverage
17 155 398
0.075 82 334
27 231 375 414
106
0.050
0.025
0.000
0 100 200 300 400
Observation number
(labelled points=leverage> 0.065 )
Cook's distance
0.03
0.02
0.01
0.00
0 100 200 300 400
Observation number
(labelled points=Cook's dist.>1)
Figure 8: Studentized residual against observation number.

5.3 Logistic Regression (Full Model)
Figure 9: Result when Training Set is fitted with Logistic Model

5.4 Forward Stepwise Logistic Regression

Figure 10: Step 1 and 2 of Forward Stepwise Regression


Figure 13 : Summary of Forward Stepwise Rergression

π̂
log 1−π̂ = −6.52744 + 0.06435 ∗ age + 0.98935 ∗ f amhist +
0.08795 ∗ tobacco + 0.03135 ∗ typea + 0.10832 ∗ ldl
5.5 Model Validation and Performance Analysis
A.Hosmer-Lemeshow Goodness-of-fit-test:
B.Confusion Matrix:
C. Receiver Operating Characteristic (ROC) curve
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0
1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2

Specificity
Area Under Curve = 0.6743

5.6 Conclusion
Age, family history of heart disease, cumulative tobacco,

typeA characteristic and Low-density lipoprotein are the
essential factors that causes the heart diseases.
The model achieved 60.42% sensitivity (percentage of people
with heart disease were actually correct identified).
The model achieved 74.44% specificity (percentage of people
without heart disease were actually correctly identified).
The model achieved 69.57% accuracy.
The value of AUC is 0.6743 indicate this model is satisfactory.
Future Work
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
Future Work
6.0 Future Work
Involve other machine learning techniques.

Decision Tree, Naı̈ve Bayes algorithm and Random Forest.
Compare different machine learning techniques.
Involve sampling techniques.
Synthetic Minor Oversampling and Adaptive Synthetic
Sampling approach.
References
Outline
1 Introduction
2 Objective
3 Literature Review
4 Methodology
6 Future Work
7 References
References
7.0 References
References
References
References

MTH3901 MINI PROJECT SLIDE 2021 Latest Copy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MTH3901 MINI PROJECT SLIDE 2021 Latest Copy

Uploaded by

Copyright:

Available Formats

MTH 3901 RESEARCH PROCESS IN MATHEMATICS

MTH 3901 RESEARCH PROCESS IN

Chin, H.Y.1 , Yeoh, R.L.W.1 , Ridzuwan, N.H.N.1 ,

1 Universiti Putra Malaysia

5 Results and Discussion

5 Results and Discussion

Cardiovascular disease-associated deaths have hastily

In 2016, the World Health Organisation (WHO) reported that

There are many factors that will contribute to heart disease.

It is essential to have early detection of heart disease before

5 Results and Discussion

To predict the event of heart disease by using multiple

To study which factors will significantly affect heart

To find the best fit of the model by using diagnostics

5 Results and Discussion

3.0 Literature Review

1. Bhatti et al.(2006) Logistic regression analysis is applied

2. Nishadi (2019) Using logistic regression to identify the

3. Saw et al. (2019) Using logistic regression model approach

4. Prasad et al. Predict heart disease with various ma-

5. Mothukuri et al. Predict heart disease using different mod-

6. Lashmanarao, Heart disease prediction using 3 differ-

7. Rajesh et al. Using Nav̈e Bayes algorithm, Decision

8. Krishnan and Using Decision trees and Naive Bayes al-

9. Yang et al. (2020) Several methods were used to build car-

10. Chicco and Jur- 10 machine learning classifiers are used

5 Results and Discussion

1 Multiple Logistic Regression Model

4.1 Multiple Logistic Regression Model

Suppose in a muliple logistic regression case, a collection of p

Let the conditional probability that the outcome is present be

Thus, the logistic regression model can be written in terms of the

Maximum Likelihood Estimation

Suppose (y1 , y2 , . . .yn ) be the n independent random observations

Since the Yi is a Bernoulli random variables, thus the probability

fi (Yi ) = πi Yi (1 − πi )1−Yi ; Yi = 0 or 1 ; i = 1, 2. . .n (2)

As the Y’s are assumed to be independent, the likelihood function

Thus, the fitted logistic response function can then be expressed as

4.2 Diagnostic Test

1. Test Linearity in the logit for continuous independent

2. Test independence of errors

3. Test multicollinearity among independent variables

Pearson’s correlation coefficient

Variance Inflation Factor (VIF)

where R2 is the coefficient of determination for the regression of that

*If V IF >10, indicate presence of multicollinearity problem.

4.Finding influential outliers

Studentized Pearson residuals

rpi : Pearson residuals

*If the studentized residuals is less than -3 and greater than

*Two times of the average leverage is declared as influential .

rpi : Pearson residuals

4.3 Forward Stepwise Regression

Forward stepwise is the method of fitting a regression model by

Akaike Informationn Criterion (AIC)

AICp = −2loge L(b) + 2p (10)

where p is the number of parameter abd loge L(b) is the log-likelihood

*model with the minium AIC value is chosen.

4.4 Model Validation and Performance analysis

Table 1 : Layout of 2x2 Confusion Matrix

Table 2 : Classification Measure by using Confusion Matrix.

Receiver operating characteristic (ROC) curve

Area Under the ROC curve (AUC)