Professional Documents
Culture Documents
[Full Name]
[SNHU Email]
This study examines cardiovascular disease databases. The results of the statistical analysis will
be used to develop a model that can be used to predict cardiovascular disease for real-world
patients. This model can be used to identify individuals for optimizing or treating patients at high
risk for cardiovascular disease. Logistic regression and random forest classification analysis will
2. Data Preparation
The important variables in this data-set are heart disease (objective), age (years), resting blood
pressure (trestbps), exercise-induced angina (exang), and heart the maximum beat obtained
(thalach) The heart_disease dataset has 303 rows and 14 different columns.
Reporting Results
e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4 )
where y is 1 for having heart disease and 0 for not having heart disease, x1 is age, x2 is
rate achieved.
The natural logarithmic format can be rewritten as
π
ln ( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π
represents individual having heart disease. While represents individuals not heart
disease.
log(odds) = -1.021 - 0.0175 * age - 0.0149 * trestbps - 1.625 * exang1 + 0.0312 * thalach
The estimated coefficient of the maximum heart rate achieved variable (thalach) is 0.031,
which means that for every 1 unit increase in thalach, the odds of default increase by
3.12%.
From Hosmer-Lemeshow goodness of fit test hypothesis testing was conducted. The Null
hypothesis is the model fits the data well against alternative hypothesis: the model does not fit
the data well. Test statistic is 44.622 and P-value is 0.612. Since the p-value is greater than 0.05,
we fail to reject the null hypothesis. Therefore, we can conclude that the model does not fit the
data well. Based on Wald’s test, only age and thalach are significant because their p-values is
Actual default:0 89 49
Accuracy = (TP + TN) / total = (89 + 134) / (89 + 49 + 31 + 134) = 0.75. The accuracy is 75%,
which means that the model correctly predicted 75% of the test cases.
Precision = TP / (TP + FP) = 89 / (89 + 49) = 0.64. The precision is 64%, which means that 64%
of the cases that the model predicted as default were actually default.
Recall = TP / (TP + FN) = 89 / (89 + 31) = 0.74. Recall is 74%, which means that 74% of the
The probability of an individual having heart disease who is 50 years old, has a resting blood
pressure of 122, has exercise induced angina, and has maximum heart rate of 140 is 0.2716. The
odds are 1:3.5. On the other hand, the probability of an individual having heart disease who is 50
years old, has a resting blood pressure of 130, does not have an exercise induced angina, and has
maximum heart rate of 165 is 0.7853. The odds are 1:0.3 for the person to have heart disease.
From the probabilities, the second individual is much more likely to have heart disease than the
first individual.
Reporting Results
e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )
E(y)=
1+ e ( β 0+ β 1 x 1+ β 2 x 2+ β 3 x 3+ β 4 x 4+ β 5 x 5+ β 6 x 6 )
wHere y is 1 for heart disease and 0 for the absence of heart disease, age x1, resting
x5 being the interaction between age group and x6 achieved between age and heart rate.
π
ln( )=β0 + β1x1 + β2x2 + β3x3+ β 4 x 4
1− π
The prediction model is;
Log(odds) = -19.2410 + 0.2885 * age - 0.0196 * trestbps + 1.9144 * cp1 + 2.0393 * cp2 +
HThe osmer-Lemeshow goodness of fit test was performed for the logistic regression model. The
null hypothesis is that the model fits the data well and the other hypothesis is that the model does
not fit the data well. The test statistic is 37.74 with 48 degrees of freedom. and the p-value is 0.
8562.Since the p-value is greater than 0.05, we cannot reject the null hypothesis. Thus, there is
insufficient evidence to conclude that the model does not fit the data well. Age, chest pain type 1
(cp1), chest pain type 2 (cp2), incidence of hypertension (thalach) and the interaction term
between age and incidence of hypertension (age: thalach) . between are significant because they
True positive: 131 (including people predicted to have heart disease and actually having heart
disease). True negative: 102 (including people who predicted they wouldn’t have a heart attack
and actually don’t). False heart attack diagnosis: 36 (including people who predicted they would
have a heart attack but actually didn’t). False negative cases: 34 (including people who predicted
they would not have heart disease but actually have heart disease). The accuracy of the model
was 0.7689, indicating that 76.89% of the predictions were correct. The accuracy of the model is
0.7844, indicating that 78.44% of the people predicted to have heart disease do have heart
disease. The recall of the model was 0.7939, indicating that 79.39% of the people with true heart
Since the curve is closer to the top-left corner it indicates the model performance is good. The
A person with heart disease who is 50 years old, has a resting blood pressure of 115, does not
have chest pain, and has a pulse rate of 133. The probability of this happening is 0.2224 The
probability of this happening ranges from 1 to 4.5. The probability of a person having a heart
attack at age 50, having a resting blood pressure of 125, having frequent angina, and having a
heart rate of 155 is 0.8068 The probability of this event ranges from 4.17 up to The probability of
the second event is much higher than the probability of the first event. This means that the
second person is more likely to have cardiovascular disease than the first person. The
possibilities and possibilities lead us to believe that the risk factors for the second person's heart
attack are varied, including elevated resting blood pressure, a history of chest chest pain, and
Reporting Results
The original dataset, training dataset, and validation dataset had 3030, 257, and 46 rows,
respectively. Below is a plot of the training and testing error versus the number of trees using a
error.
Actual default:0 12 6
Actual default:1 5 23
Accuracy = (True Positive + True Negative) / Total = (12 + 23) / (12 + 5 + 6 + 23) = 0.76. The
accuracy was 76%, which means that the model correctly predicted 84% of the test cases.
Accuracy = true positive / (true positive + false positive) = 12 / (12 + 6) = 0.66. The accuracy is
66%, which means that 66% of the cases predicted by the model to be defaults are defaults.
Recall = true positive / (true positive + false negative) = 23 / (23 + 5) = 0.82. The recall was
87%, which means that 87% of the actual default cases were correctly predicted by the model.
Reporting Results
The original dataset, training, and validation datasets contain 303, 242, and 61 rows,
respectively. A plot of the mean squared error versus the number of trees in the random forest
From the analysis, the optimal number of trees for the random forest sample is 4 because it
n= 242
The neural error of the training set is 18.5732 while the neural error of the test set is 21.7213.
Considering the squared error of the tree root, the root error is 524.4. This value represents the
sum of the difference between the actual `thalach` value and the predicted value at the root node.
The table provided includes information about the complexity parameter (CP) and its
relationship with the number of separations, relative error, cross-validation error, and standard
indicator of performance on unseen data, and a lower value indicates improved generalizability.
21.7213
18.5732
RMSE values: 21.7213 and 18.5732. The RMSE value represents the amount of residuals (the
difference between the predicted values and the actual value) in the test data set.
A low RMSE value indicates that the model has a small prediction error, indicating a good
performance in fitting the test data. Thus, the RMSE value of 18.5732 is better than 21.7213, as
it shows a smaller difference between the predicted value and the actual value.
The RMSE value depends on the scale of the target variable. without additional explanations or
7. Conclusion
To predict cardiovascular disease, I would run a second logistic regression model. The second
model performs better than the previous model in terms of accuracy (76.89%), precision
(78.44%) and recall (79.39%). The ROC curve of the second model is also located significantly
towards the top left, indicating its effectiveness as a classifier. Based on the findings of the
studies, I would advise the use of random forest classification for predicting cardiovascular
disease. In addition to being easier to fit the overtraining data, the random forest model is slightly
A practical advantage of applying what we have learned to develop a model that can be used to
predict cardiovascular disease in patients in the real world is that this model can be used to
monitor or treat high-risk patients containing accurate diagnosis of cardiovascular disease and