Professional Documents
Culture Documents
net/publication/352261064
CITATIONS READS
0 4,532
1 author:
Jose-A Tavares
Tecnológico de Monterrey
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Augmented Reality as a tool of the Internet of Things in the Manufacturing Industry View project
All content following this page was uploaded by Jose-A Tavares on 09 June 2021.
The violin plot in Fig. 3 showed that there might be a The remaining features didn’t showed a clear
strong correlation between age and Stroke. This means that relationship, but they must be analyzed as well because that
the Dataset implies that older people is more likely to suffer relationship might not be obvious by simple observation, so
from Stroke. automatic feature selection methods were used.
D. Feature Selection Different ML algorithms were used for the training
Feature selection is an important task prior to preparing process, which accuracies are shown in Table 3.
the data for the training process. If all the features are
considered when training our model, some of them might TABLE III. ML MODELS ACCURACY
introduce noise, limiting the model´s accuracy. Pearson
Correlation technique helped to calculate the correlation Model Accuracy
Decision Tree 0.8994
between features, as represented with a Heatmap as Fig.6. Logistic Regression 0.8049
From this, it was identified a high correlation between Naïve Bayes 0.7954
features ever_married and age. Relation between features is K Neighbors Classifier 0.8724
not desirable because it can cause noise to be present during Random Forest 0.9189
training process, decreasing model accuracy. Therefore, it Neural Network 0.8294
Support Vector Machine 0.8144
was decided to get rid of ever_married feature because of the XGBoost Classifier 0.9059
lower correlation with the target variable.
Model Accuracy
Decision Tree 0.9011
Logistic Regression 0.8059
Naïve Bayes 0.7907
K Neighbors Classifier 0.8790
Random Forest 0.9232
Neural Network 0.8241
Support Vector Machine 0.8220
XGBoost Classifier 0.9152
V. CONCLUSIONS
Fig. 8. Random Forest model Confusion matrix
This research proved that through Data Science and the
It can be noted that the model predicted a total of 70 false use of Machine Learning algorithms, it was possible to
negatives and 96 false positives, with an accuracy of 0.917 predict Stroke outcome from known information about
and an F-score of 0.920. A Receiver Operating Characteristic individuals. Also, the CRISP-DM methodology helped as a
(ROC) curve was also built and the area under the ROC guiding through the analysis of the data, which made this
curve (AUC) was calculated. The higher the AUC, the better process much simpler and efficient without loosing track in
is the model at correctly classifying instances. The was an the Business problem itself and taking the right decisions
AUC of 0.975, which is presented in the ROC curve shown based on it.
in Fig.9.
REFERENCES
[1] NHLBI, “What Is a Stroke?,” 2014.
https://web.archive.org/web/20150218230259/http://www.nhlbi.nih.g
ov/health/health-topics/topics/stroke/# (accessed Jun. 06, 2021).
[2] WHO, “The top 10 causes of death.” https://www.who.int/news-
room/fact-sheets/detail/the-top-10-causes-of-death (accessed Jun. 09,
2021).
[3] M. A. Moskowitz, E. H. Lo, and C. Iadecola, “The science of stroke:
Mechanisms in search of treatments,” Neuron, vol. 67, no. 2, pp. 181–
198, 2010, doi: 10.1016/j.neuron.2010.07.002.
[4] T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to
cerebral stroke prediction based on imbalanced medical dataset,”
Artif. Intell. Med., vol. 101, no. September, p. 101723, 2019, doi:
10.1016/j.artmed.2019.101723.
[5] J. K. Kim, Y. J. Choo, and M. C. Chang, “Prediction of Motor
Function in Stroke Patients Using Machine Learning Algorithm:
Fig. 9. ROC curve and AUC score Development of Practical Models,” J. Stroke Cerebrovasc. Dis., vol.
30, no. 8, p. 105856, 2021, doi:
10.1016/j.jstrokecerebrovasdis.2021.105856.
IV. DISCUSSION [6] Y. Hbid, M. Fahey, C. D. A. Wolfe, M. Obaid, and A. Douiri, “Risk
Prediction of Cognitive Decline after Stroke,” J. Stroke Cerebrovasc.
Random Forest seems to have perform very good for Dis., vol. 30, no. 8, p. 105849, 2021, doi:
predicting Stroke outcome with the information that was 10.1016/j.jstrokecerebrovasdis.2021.105849.
provided by the Dataset. As for many models for disease [7] A. Dey, “Machine Learning Algorithms: A Review,” Int. J. Comput.
prediction, the cost of having false negatives is much higher Sci. Inf. Technol., vol. 7, no. 3, pp. 1174–1179, 2016, [Online].
Available: www.ijcsit.com.
than the one if predicting false positives. This is because,
[8] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine
even though the person is not prone to suffering from a learning: a review of classification and combining techniques,” Artif.
stroke, taking precaution and the recommended treatment has Intell. Rev., vol. 26, no. 3, pp. 159–190, 2006, doi: 10.1007/s10462-
low negative impact in persons health. On the other hand, if 007-9052-3.
we suppose that we say that the person has not risk of having [9] Z. Usmani, “What is Kaggle, Why I Participate, What is the Impact? |
such disease, we can even put its life in danger. This is why Data Science and Machine Learning,” p. 44916, 2017, Accessed: Jun.
it is very important for the model to be evaluated depending 06, 2021. [Online]. Available: https://www.kaggle.com/getting-
started/44916.
on the Business problem to be solved. The model resulting
[10] S. Raschka, J. Patterson, and C. Nolet, “Machine learning in python:
from this research shown a very low rate of False negative Main developments and technology trends in data science, machine
predictions, so it seems that this model works with good learning, and artificial intelligence,” Inf., vol. 11, no. 4, 2020, doi:
performance. 10.3390/info11040193.
[11] H. G. Ceballos, R. Morales-menendez, and R. A. Ramírez-, “A
From the features considered for the model construction, Research-based Learning Approach to Teach Data Science using
Hypertension and Heart disease are clinical indicators for Covid-19 and Related Domains,” pp. 1–28.
Stroke risk. Therefore, treating these diseases can help in [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
reducing it. Also, the Average glucose level is another “SMOTE: Syntethic Minority Over-Sampling Technique,” J. Artif.
clinical parameter that can be controlled and treated by Intell. Res., 2002, doi: 10.1613/jair.953.
medical staff. Smoking is another factor, so avoiding it is