Project

MAT 303 Project Two Summary Report
[Full Name]
[SNHU Email]
Southern New Hampshire University

1. Introduction
This study examines cardiovascular disease databases. The results of the statistical analysis will
be used to develop a model that can be used to predict cardiovascular disease for real-world
patients. This model can be used to identify individuals for optimizing or treating patients at high
risk for cardiovascular disease. Logistic regression and random forest classification analysis will
be used in this study.
2. Data Preparation
The important variables in this data-set are heart disease (objective), age (years), resting blood
pressure (trestbps), exercise-induced angina (exang), and heart the maximum beat obtained
(thalach) The heart_disease dataset has 303 rows and 14 different columns.
3. Model #1 - First Logistic Regression Model
Reporting Results
The general form of this regression model is:
e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )
where y is 1 for having heart disease and 0 for not having heart disease, x1 is age, x2 is
persons resting blood pressure, x3 is exercise-induced angina and x4 is maximum heart
rate achieved.
The natural logarithmic format can be rewritten as
π
ln ( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π
represents individual having heart disease. While represents individuals not heart
disease.
The estimated logit model can be written as:
log(odds) = -1.021 - 0.0175 * age - 0.0149 * trestbps - 1.625 * exang1 + 0.0312 * thalach
The estimated coefficient of the maximum heart rate achieved variable (thalach) is 0.031,
which means that for every 1 unit increase in thalach, the odds of default increase by
3.12%.
Evaluating Model Significance
From Hosmer-Lemeshow goodness of fit test hypothesis testing was conducted. The Null
hypothesis is the model fits the data well against alternative hypothesis: the model does not fit
the data well. Test statistic is 44.622 and P-value is 0.612. Since the p-value is greater than 0.05,
we fail to reject the null hypothesis. Therefore, we can conclude that the model does not fit the
data well. Based on Wald’s test, only age and thalach are significant because their p-values is
less than level of significance.
Prediction default:0 Prediction default:1
Actual default:0 89 49
Prediction default:1 31 134

True positives (TP) is 89, true negatives (TN) is 134, false positives (FP) is 49 and false
negatives (FN) is 31.
Accuracy = (TP + TN) / total = (89 + 134) / (89 + 49 + 31 + 134) = 0.75. The accuracy is 75%,
which means that the model correctly predicted 75% of the test cases.
Precision = TP / (TP + FP) = 89 / (89 + 49) = 0.64. The precision is 64%, which means that 64%
of the cases that the model predicted as default were actually default.
Recall = TP / (TP + FN) = 89 / (89 + 31) = 0.74. Recall is 74%, which means that 74% of the
actual default cases were correctly predicted by the model.
The ROC Curve

Since the curve is closer to the top-left corner it indicates the model performance is good. The
value of AUC is 0.8007 indicating that the model is a good classifier.
Making Predictions Using Model
The probability of an individual having heart disease who is 50 years old, has a resting blood
pressure of 122, has exercise induced angina, and has maximum heart rate of 140 is 0.2716. The
odds are 1:3.5. On the other hand, the probability of an individual having heart disease who is 50
years old, has a resting blood pressure of 130, does not have an exercise induced angina, and has
maximum heart rate of 165 is 0.7853. The odds are 1:0.3 for the person to have heart disease.
From the probabilities, the second individual is much more likely to have heart disease than the
first individual.
4. Model #2 - Second Logistic Regression Model
Reporting Results
The general form of this regression model is:
e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )
wHere y is 1 for heart disease and 0 for the absence of heart disease, age x1, resting
population x2 blood pressure, x3 exercise-induced angina and x4 developing heart rate,
x5 being the interaction between age group and x6 achieved between age and heart rate.
The natural logarithmic format can be rewritten as
π
ln( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π
The prediction model is;
Log(odds) = -19.2410 + 0.2885 * age - 0.0196 * trestbps + 1.9144 * cp1 + 2.0393 * cp2 +
1.7812 * cp3 + 0.1443 * thalach - 0.0020 * age * thalach
Evaluating Model Significance
HThe osmer-Lemeshow goodness of fit test was performed for the logistic regression model. The
null hypothesis is that the model fits the data well and the other hypothesis is that the model does
not fit the data well. The test statistic is 37.74 with 48 degrees of freedom. and the p-value is 0.
8562.Since the p-value is greater than 0.05, we cannot reject the null hypothesis. Thus, there is
insufficient evidence to conclude that the model does not fit the data well. Age, chest pain type 1
(cp1), chest pain type 2 (cp2), incidence of hypertension (thalach) and the interaction term
between age and incidence of hypertension (age: thalach) . between are significant because they
have 95% confidence intervals do not include zero.
Prediction default:0 Prediction:1
True positive: 131 (including people predicted to have heart disease and actually having heart
disease). True negative: 102 (including people who predicted they wouldn’t have a heart attack
and actually don’t). False heart attack diagnosis: 36 (including people who predicted they would
have a heart attack but actually didn’t). False negative cases: 34 (including people who predicted
they would not have heart disease but actually have heart disease). The accuracy of the model
was 0.7689, indicating that 76.89% of the predictions were correct. The accuracy of the model is
0.7844, indicating that 78.44% of the people predicted to have heart disease do have heart
disease. The recall of the model was 0.7939, indicating that 79.39% of the people with true heart
disease were predicted correctly to have heart disease.
Receiver Operating Characteristic (ROC) curve.
Since the curve is closer to the top-left corner it indicates the model performance is good. The
value of AUC is 0.8477 indicating that the model is a good classifier.

Making Predictions Using Model
A person with heart disease who is 50 years old, has a resting blood pressure of 115, does not
have chest pain, and has a pulse rate of 133. The probability of this happening is 0.2224 The
probability of this happening ranges from 1 to 4.5. The probability of a person having a heart
attack at age 50, having a resting blood pressure of 125, having frequent angina, and having a
heart rate of 155 is 0.8068 The probability of this event ranges from 4.17 up to The probability of
the second event is much higher than the probability of the first event. This means that the
second person is more likely to have cardiovascular disease than the first person. The
possibilities and possibilities lead us to believe that the risk factors for the second person's heart
attack are varied, including elevated resting blood pressure, a history of chest chest pain, and
increased heart rate
5. Random Forest Classification Model
Reporting Results
The original dataset, training dataset, and validation dataset had 3030, 257, and 46 rows,
respectively. Below is a plot of the training and testing error versus the number of trees using a
classification random forest model to predict the presence of heart failure:

The optimal number of trees for the random forest model is four since it gives the least relative
error.
Evaluating the Utility of the model
Prediction default:0 Prediction default:1
Accuracy = (True Positive + True Negative) / Total = (12 + 23) / (12 + 5 + 6 + 23) = 0.76. The
accuracy was 76%, which means that the model correctly predicted 84% of the test cases.
Accuracy = true positive / (true positive + false positive) = 12 / (12 + 6) = 0.66. The accuracy is
66%, which means that 66% of the cases predicted by the model to be defaults are defaults.
Recall = true positive / (true positive + false negative) = 23 / (23 + 5) = 0.82. The recall was
87%, which means that 87% of the actual default cases were correctly predicted by the model.
6. Random Forest Regression Model
Reporting Results
The original dataset, training, and validation datasets contain 303, 242, and 61 rows,
respectively. A plot of the mean squared error versus the number of trees in the random forest
regression model for maximum coma is shown below;
From the analysis, the optimal number of trees for the random forest sample is 4 because it
introduces the smallest relative error in the results

Evaluating the Utility of the Random Forest Regression Model
Variables actually used in tree construction:
[1] age chol cp exang trestbps
Root node error: 126906/242 = 524.4
n= 242
CP nsplit rel error xerror xstd
1 0.151987 0 1.00000 1.01234 0.082873
2 0.086013 1 0.84801 0.94065 0.083911
3 0.048658 2 0.76200 0.84510 0.075162
4 0.031129 3 0.71334 0.85248 0.073349
5 0.024395 4 0.68221 0.84870 0.071240

6 0.016907 5 0.65782 0.93527 0.079425
7 0.016817 7 0.62400 0.94811 0.077117
8 0.013184 8 0.60719 0.90457 0.076639
9 0.012221 9 0.59400 0.92222 0.078053
10 0.010000 10 0.58178 0.94182 0.079300
The neural error of the training set is 18.5732 while the neural error of the test set is 21.7213.
Considering the squared error of the tree root, the root error is 524.4. This value represents the
sum of the difference between the actual `thalach` value and the predicted value at the root node.
The table provided includes information about the complexity parameter (CP) and its
relationship with the number of separations, relative error, cross-validation error, and standard
deviation of the cross-validation error Decreased. The error of cross-validation serves as an
indicator of performance on unseen data, and a lower value indicates improved generalizability.
Root mean squared error
There are two root mean squared error
[1] "Root Mean Squared Error"
21.7213
[1] "Root Mean Squared Error"
18.5732
RMSE values: 21.7213 and 18.5732. The RMSE value represents the amount of residuals (the
difference between the predicted values and the actual value) in the test data set.
A low RMSE value indicates that the model has a small prediction error, indicating a good
performance in fitting the test data. Thus, the RMSE value of 18.5732 is better than 21.7213, as
it shows a smaller difference between the predicted value and the actual value.
The RMSE value depends on the scale of the target variable. without additional explanations or
original specimens for comparison.
7. Conclusion
To predict cardiovascular disease, I would run a second logistic regression model. The second
model performs better than the previous model in terms of accuracy (76.89%), precision
(78.44%) and recall (79.39%). The ROC curve of the second model is also located significantly
towards the top left, indicating its effectiveness as a classifier. Based on the findings of the
studies, I would advise the use of random forest classification for predicting cardiovascular
disease. In addition to being easier to fit the overtraining data, the random forest model is slightly
more accurate than the logistic regression models .
A practical advantage of applying what we have learned to develop a model that can be used to
predict cardiovascular disease in patients in the real world is that this model can be used to
monitor or treat high-risk patients containing accurate diagnosis of cardiovascular disease and
identification of those individuals.

Project

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project

Uploaded by

Copyright:

Available Formats

MAT 303 Project Two Summary Report

Southern New Hampshire University

be used in this study.

3. Model #1 - First Logistic Regression Model

The general form of this regression model is:

persons resting blood pressure, x3 is exercise-induced angina and x4 is maximum heart

The estimated logit model can be written as:

Evaluating Model Significance

less than level of significance.

Prediction default:0 Prediction default:1

Prediction default:1 31 134

negatives (FN) is 31.

actual default cases were correctly predicted by the model.

The ROC Curve

value of AUC is 0.8007 indicating that the model is a good classifier.

Making Predictions Using Model

4. Model #2 - Second Logistic Regression Model

The general form of this regression model is:

population x2 blood pressure, x3 exercise-induced angina and x4 developing heart rate,

The natural logarithmic format can be rewritten as

1.7812 * cp3 + 0.1443 * thalach - 0.0020 * age * thalach

Evaluating Model Significance

have 95% confidence intervals do not include zero.

Prediction default:0 Prediction:1

Actual default:0 102 36

Actual default:1 34 131

disease were predicted correctly to have heart disease.

Receiver Operating Characteristic (ROC) curve.

value of AUC is 0.8477 indicating that the model is a good classifier.

increased heart rate

5. Random Forest Classification Model

classification random forest model to predict the presence of heart failure:

Evaluating the Utility of the model

Prediction default:0 Prediction default:1

6. Random Forest Regression Model

regression model for maximum coma is shown below;

introduces the smallest relative error in the results

Variables actually used in tree construction:

[1] age chol cp exang trestbps

Root node error: 126906/242 = 524.4

CP nsplit rel error xerror xstd

1 0.151987 0 1.00000 1.01234 0.082873

2 0.086013 1 0.84801 0.94065 0.083911

3 0.048658 2 0.76200 0.84510 0.075162

4 0.031129 3 0.71334 0.85248 0.073349

5 0.024395 4 0.68221 0.84870 0.071240

7 0.016817 7 0.62400 0.94811 0.077117

8 0.013184 8 0.60719 0.90457 0.076639

9 0.012221 9 0.59400 0.92222 0.078053

10 0.010000 10 0.58178 0.94182 0.079300

deviation of the cross-validation error Decreased. The error of cross-validation serves as an

Root mean squared error

There are two root mean squared error

[1] "Root Mean Squared Error"

[1] "Root Mean Squared Error"

original specimens for comparison.

more accurate than the logistic regression models .

identification of those individuals.

You might also like