You are on page 1of 13

MAT 303 Project Two Summary Report

[Full Name]

[SNHU Email]

Southern New Hampshire University


1. Introduction

This study examines cardiovascular disease databases. The results of the statistical analysis will

be used to develop a model that can be used to predict cardiovascular disease for real-world

patients. This model can be used to identify individuals for optimizing or treating patients at high

risk for cardiovascular disease. Logistic regression and random forest classification analysis will

be used in this study.

2. Data Preparation

The important variables in this data-set are heart disease (objective), age (years), resting blood

pressure (trestbps), exercise-induced angina (exang), and heart the maximum beat obtained

(thalach) The heart_disease dataset has 303 rows and 14 different columns.

3. Model #1 - First Logistic Regression Model

Reporting Results

The general form of this regression model is:

e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )

where y is 1 for having heart disease and 0 for not having heart disease, x1 is age, x2 is

persons resting blood pressure, x3 is exercise-induced angina and x4 is maximum heart

rate achieved.
The natural logarithmic format can be rewritten as

π
ln ( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π

represents individual having heart disease. While represents individuals not heart

disease.

The estimated logit model can be written as:

log(odds) = -1.021 - 0.0175 * age - 0.0149 * trestbps - 1.625 * exang1 + 0.0312 * thalach

The estimated coefficient of the maximum heart rate achieved variable (thalach) is 0.031,

which means that for every 1 unit increase in thalach, the odds of default increase by

3.12%.

Evaluating Model Significance

From Hosmer-Lemeshow goodness of fit test hypothesis testing was conducted. The Null

hypothesis is the model fits the data well against alternative hypothesis: the model does not fit

the data well. Test statistic is 44.622 and P-value is 0.612. Since the p-value is greater than 0.05,

we fail to reject the null hypothesis. Therefore, we can conclude that the model does not fit the

data well. Based on Wald’s test, only age and thalach are significant because their p-values is

less than level of significance.

Prediction default:0 Prediction default:1

Actual default:0 89 49

Prediction default:1 31 134


True positives (TP) is 89, true negatives (TN) is 134, false positives (FP) is 49 and false

negatives (FN) is 31.

Accuracy = (TP + TN) / total = (89 + 134) / (89 + 49 + 31 + 134) = 0.75. The accuracy is 75%,

which means that the model correctly predicted 75% of the test cases.

Precision = TP / (TP + FP) = 89 / (89 + 49) = 0.64. The precision is 64%, which means that 64%

of the cases that the model predicted as default were actually default.

Recall = TP / (TP + FN) = 89 / (89 + 31) = 0.74. Recall is 74%, which means that 74% of the

actual default cases were correctly predicted by the model.

The ROC Curve


Since the curve is closer to the top-left corner it indicates the model performance is good. The

value of AUC is 0.8007 indicating that the model is a good classifier.

Making Predictions Using Model

The probability of an individual having heart disease who is 50 years old, has a resting blood

pressure of 122, has exercise induced angina, and has maximum heart rate of 140 is 0.2716. The

odds are 1:3.5. On the other hand, the probability of an individual having heart disease who is 50

years old, has a resting blood pressure of 130, does not have an exercise induced angina, and has

maximum heart rate of 165 is 0.7853. The odds are 1:0.3 for the person to have heart disease.

From the probabilities, the second individual is much more likely to have heart disease than the

first individual.

4. Model #2 - Second Logistic Regression Model

Reporting Results

The general form of this regression model is:

e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )

wHere y is 1 for heart disease and 0 for the absence of heart disease, age x1, resting

population x2 blood pressure, x3 exercise-induced angina and x4 developing heart rate,

x5 being the interaction between age group and x6 achieved between age and heart rate.

The natural logarithmic format can be rewritten as

π
ln( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π
The prediction model is;

Log(odds) = -19.2410 + 0.2885 * age - 0.0196 * trestbps + 1.9144 * cp1 + 2.0393 * cp2 +

1.7812 * cp3 + 0.1443 * thalach - 0.0020 * age * thalach

Evaluating Model Significance

HThe osmer-Lemeshow goodness of fit test was performed for the logistic regression model. The

null hypothesis is that the model fits the data well and the other hypothesis is that the model does

not fit the data well. The test statistic is 37.74 with 48 degrees of freedom. and the p-value is 0.

8562.Since the p-value is greater than 0.05, we cannot reject the null hypothesis. Thus, there is

insufficient evidence to conclude that the model does not fit the data well. Age, chest pain type 1

(cp1), chest pain type 2 (cp2), incidence of hypertension (thalach) and the interaction term

between age and incidence of hypertension (age: thalach) . between are significant because they

have 95% confidence intervals do not include zero.

Prediction default:0 Prediction:1

Actual default:0 102 36

Actual default:1 34 131

True positive: 131 (including people predicted to have heart disease and actually having heart

disease). True negative: 102 (including people who predicted they wouldn’t have a heart attack

and actually don’t). False heart attack diagnosis: 36 (including people who predicted they would

have a heart attack but actually didn’t). False negative cases: 34 (including people who predicted
they would not have heart disease but actually have heart disease). The accuracy of the model

was 0.7689, indicating that 76.89% of the predictions were correct. The accuracy of the model is

0.7844, indicating that 78.44% of the people predicted to have heart disease do have heart

disease. The recall of the model was 0.7939, indicating that 79.39% of the people with true heart

disease were predicted correctly to have heart disease.

Receiver Operating Characteristic (ROC) curve.

Since the curve is closer to the top-left corner it indicates the model performance is good. The

value of AUC is 0.8477 indicating that the model is a good classifier.


Making Predictions Using Model

A person with heart disease who is 50 years old, has a resting blood pressure of 115, does not

have chest pain, and has a pulse rate of 133. The probability of this happening is 0.2224 The

probability of this happening ranges from 1 to 4.5. The probability of a person having a heart

attack at age 50, having a resting blood pressure of 125, having frequent angina, and having a

heart rate of 155 is 0.8068 The probability of this event ranges from 4.17 up to The probability of

the second event is much higher than the probability of the first event. This means that the

second person is more likely to have cardiovascular disease than the first person. The

possibilities and possibilities lead us to believe that the risk factors for the second person's heart

attack are varied, including elevated resting blood pressure, a history of chest chest pain, and

increased heart rate

5. Random Forest Classification Model

Reporting Results

The original dataset, training dataset, and validation dataset had 3030, 257, and 46 rows,

respectively. Below is a plot of the training and testing error versus the number of trees using a

classification random forest model to predict the presence of heart failure:


The optimal number of trees for the random forest model is four since it gives the least relative

error.

Evaluating the Utility of the model

Prediction default:0 Prediction default:1

Actual default:0 12 6

Actual default:1 5 23

Accuracy = (True Positive + True Negative) / Total = (12 + 23) / (12 + 5 + 6 + 23) = 0.76. The

accuracy was 76%, which means that the model correctly predicted 84% of the test cases.
Accuracy = true positive / (true positive + false positive) = 12 / (12 + 6) = 0.66. The accuracy is

66%, which means that 66% of the cases predicted by the model to be defaults are defaults.

Recall = true positive / (true positive + false negative) = 23 / (23 + 5) = 0.82. The recall was

87%, which means that 87% of the actual default cases were correctly predicted by the model.

6. Random Forest Regression Model

Reporting Results

The original dataset, training, and validation datasets contain 303, 242, and 61 rows,

respectively. A plot of the mean squared error versus the number of trees in the random forest

regression model for maximum coma is shown below;

From the analysis, the optimal number of trees for the random forest sample is 4 because it

introduces the smallest relative error in the results


Evaluating the Utility of the Random Forest Regression Model

Variables actually used in tree construction:

[1] age chol cp exang trestbps

Root node error: 126906/242 = 524.4

n= 242

CP nsplit rel error xerror xstd

1 0.151987 0 1.00000 1.01234 0.082873

2 0.086013 1 0.84801 0.94065 0.083911

3 0.048658 2 0.76200 0.84510 0.075162

4 0.031129 3 0.71334 0.85248 0.073349

5 0.024395 4 0.68221 0.84870 0.071240


6 0.016907 5 0.65782 0.93527 0.079425

7 0.016817 7 0.62400 0.94811 0.077117

8 0.013184 8 0.60719 0.90457 0.076639

9 0.012221 9 0.59400 0.92222 0.078053

10 0.010000 10 0.58178 0.94182 0.079300

The neural error of the training set is 18.5732 while the neural error of the test set is 21.7213.

Considering the squared error of the tree root, the root error is 524.4. This value represents the

sum of the difference between the actual `thalach` value and the predicted value at the root node.

The table provided includes information about the complexity parameter (CP) and its

relationship with the number of separations, relative error, cross-validation error, and standard

deviation of the cross-validation error Decreased. The error of cross-validation serves as an

indicator of performance on unseen data, and a lower value indicates improved generalizability.

Root mean squared error

There are two root mean squared error

[1] "Root Mean Squared Error"

21.7213

[1] "Root Mean Squared Error"

18.5732

RMSE values: 21.7213 and 18.5732. The RMSE value represents the amount of residuals (the

difference between the predicted values and the actual value) in the test data set.
A low RMSE value indicates that the model has a small prediction error, indicating a good

performance in fitting the test data. Thus, the RMSE value of 18.5732 is better than 21.7213, as

it shows a smaller difference between the predicted value and the actual value.

The RMSE value depends on the scale of the target variable. without additional explanations or

original specimens for comparison.

7. Conclusion

To predict cardiovascular disease, I would run a second logistic regression model. The second

model performs better than the previous model in terms of accuracy (76.89%), precision

(78.44%) and recall (79.39%). The ROC curve of the second model is also located significantly

towards the top left, indicating its effectiveness as a classifier. Based on the findings of the

studies, I would advise the use of random forest classification for predicting cardiovascular

disease. In addition to being easier to fit the overtraining data, the random forest model is slightly

more accurate than the logistic regression models .

A practical advantage of applying what we have learned to develop a model that can be used to

predict cardiovascular disease in patients in the real world is that this model can be used to

monitor or treat high-risk patients containing accurate diagnosis of cardiovascular disease and

identification of those individuals.

You might also like