You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/352261064

Stroke prediction through Data Science and Machine Learning Algorithms

Preprint · June 2021


DOI: 10.13140/RG.2.2.33027.43040

CITATIONS READS

0 4,532

1 author:

Jose-A Tavares
Tecnológico de Monterrey
1 PUBLICATION   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Augmented Reality as a tool of the Internet of Things in the Manufacturing Industry View project

All content following this page was uploaded by Jose-A Tavares on 09 June 2021.

The user has requested enhancement of the downloaded file.


Stroke prediction through Data Science and
Machine Learning Algorithms
José Alberto Tavares Rodríguez
School of Engineering and Sciences
Tecnológico de Monterrey
Monterrey, México
A01232678@itesm.mx

Abstract— The number of applications of Data Science has


been increasing rapidly over the last years due to its A. Related work
accelerated development and the growing research done about Data Science have been studied previously as a tool for
it. Medicine is one of the most benefited areas and data-based predictions about Stroke. One of them is the development of
decisions are being trusted increasingly because of its efficiency a hybrid machine learning approach for cerebral stroke
and accuracy, where the decisions made by the medical staff prediction [4]. Despite the similarity with the research
are vital for patients. Stroke is the second cause of death described in this paper, the method used was different, and
worldwide, and it is possible to create a model to predict its looking for a better performing model is another objective
outcome from past information and known occurrences. Many for this research. Not only Stroke prediction but its
of Stroke´s risk indicators can be controlled, which makes consequences have been widely studied. For example, Kim
Stroke prediction very promising to reduce the chance of
et al. [5] developed a Deep Neural Network (DNN) for
suffering from it by taking the required actions and treat
predicting the motor outcome at six months after a Stroke by
people early enough.
analyzing the data from 1,056 consecutive stroke patients
This research article aims apply Data Analytics and use with the objective of providing important information for
Machine Learning to create a model capable of predicting clinicians to establish appropriate rehabilitation strategies for
Stroke outcome based on an unbalanced dataset containing patients. Also, in [6] Hbid et al. evaluated a mixed-effect
information about 5110 individuals from which Stroke linear model to predict the risk of cognitive decline post-
outcome is known. The CRISP-DM methodology, widely used stroke.
in many Data science applications, was used as guidance
throughout this research. The result obtained was the B. ML algorithms
development and evaluation of different models based on
Sometimes, it can be difficult to interpret the patterns or
Machine Learning techniques from which Random Forest
classifier showed to be the best performing.
extract useful information from data. This can be because of
the amount of information or because the relationship
Keywords—Dataset, Data Science, disease prediction, between its components might not be easily perceptible. This
Machine Learning, Stroke is where ML techniques are very useful in order to teach, as
its name implies, machines to handle data more efficiently
[7].
I. INTRODUCTION
Stroke, also known as brain attack, happens when blood Although there are many ML algorithms and variants out
flow to the brain is blocked, preventing it from getting there, its applicability depends on the data composition and
oxygen and nutrients from it and causing the death of brain the research question to be answered. Supervised
cells within minutes [1]. According to the World Health classification is one of the tasks more frequently carried out
Organization (WHO), it is the second cause of death by Intelligent Systems, which goal is to build concise model
worldwide after ischemic heart disease [2]. Stroke victims of the distribution of class labels in terms of predictor
can experience paralysis, impaired speed, or loss of vision. features. Then the resulting classifier is used to assign class
While some of the Stroke risk factors cannot be modified, labels to the testing instances where the values of the
such as family history of cerebrovascular diseases, age, predictors are known, but the value of the class label is not
gender and race, others can and are estimated to account for [8].
60% -80% of stroke risk in the general population [3].
Therefore, predicting stroke outcome for new cases can be II. METHODS
determining for them to be treated early enough and avoid
disabling and mortal consequences. A. Dataset
The purpose of this research is to create an accurate For the creation of the required model for Stroke
model for predicting Stroke outcome with Data Science and prediction, it is important to use the right data for this as well
Machine Learning (ML) based on previous data and as including important features for doing so. Datasets are
individual characteristics. Providing useful information for collections of information arranged in a way that makes them
the medical staff to deploy the needed treatment and decrease very easy to manipulate, modify and can be obtained from
risks and consequences. many databases and sources on the Internet. For this
research, a dataset retrieved from Kaggle was used. Kaggle is
a well-known Machine Learning and Data Science
community, which provides accessibility to a huge repository
of community data and code [9].
B. Software requirements determine the better one depending on what it is expected as
For managing the dataset and the model training process, outcome, as well as the business question to be answered or
Python programming language was used in Jupyter the problem to be solved. If more than one models are
Notebook platform, which enhances data manipulation and applicable, an evaluation must be performed to identify the
visualization. Python is a high-level interpreted programming one that performs the best and has the better accuracy for the
language, which is very easy to learn while still being able to problem to be solved. For this stage, eight different ML
harness the power of system-level programming languages algorithms were used and compared in order to identify the
when necessary. Aside from these benefits, the community best performing among them.
around the available tools and libraries make it particularly e) Evaluation
attractive for workloads in data science, machine learning,
and scientific computing [10]. After modeling, it is important to assess the performance
of each of the models that were able to obtain from the used
dataset. Not only accuracy, but recall and precision were
C. CRISP-DM Methodology
used as parameters to evaluate the resulting model.
Cross Industry Standard Process for Data Mining
(CRISP-DM) is a well-known and widely used methodology f) Deployment
used in many Data Science projects [11]. This methodology This last step aims to use the results and the information
is composed by six different stages: Business Understanding, obtained from the study. This means that it has to be
Data Understanding, Data Preparation, Modeling, Evaluation available so that the interested businesses can take advantage
and Deployment. These stages serve as a guidance from it, as well as to be used for other researchers for further
throughout Data Analysis process and are very helpful to analysis or confirmation. For achieving this, this article draft
provide structure to the research while not losing track on was uploaded to Research Gate, while the Jupyter Notebook
research question to be answered. containing the Python code was uploaded to Kaggle.
a) Business Undestanding
Business Understanding is one of the most important III. RESULTS
steps, but also many times ignored. In order to look for the The Dataset used for this research, includes information
required solutions to the problem, it is needed to identify about 5110 individuals characterized by 10 features and
what stakeholders are searching for and what they are includes information about the target variable Stroke for all
interested in. In this case, we are trying to identify whether a of them, which allows a supervised learning approach. As
person is susceptible of suffering from a stroke from listed in Table 1, features are both demographic and clinical
knowing some data about him/she to provide enough data, and their values can be either numerical or categorical.
information and lower the risk of suffering from it. The cost
of predictions is something that it is also to be considered, as TABLE I. DATA DEMOGRAPHY
the decisions made around the obtained information is
crucial for patients’ lives. Features Values
Demographic data
b) Data Undestanding Number of individuals 5110
Age, years 1 – 82
After having a wider picture about the problem that is Gender Male/Female
desired to solve, we need to take a look on the data to be Ever married 1/0
used for this matter and understand what kind of information Work type Private/Self-
it provides. The dataset to be used, contains information Employed/Govt_job/children/never worked
about 5110 people and 10 different features and Stroke Residence type Urban/Rural
outcome occurrences. This means that we are dealing with a Clinical data
supervised learning approach and this need to be considered Hypertension 1/0
when analyzing data and selecting the proper model Heart disease 1/0
algorithms. Many times, some manipulation to the dataset is Avg Glucose Level 55.12 – 271.74
needed, especially when the data is not complete or not BMI a 10.3 – 97.6
represented for it to be easier to understand. Exploratory Stroke 1/0
Data Analysis (EDA) then helps to visualize the data with
Other
the help of some plotting, providing an overview of how the Smoking status Formerly smoked/never
features relate each other and sometimes making possible to smoked/smokes/Unknown
formulate first hypothesis. a.
BMI stands for Body mass index.

c) Data Preparation A. Data imputation and re-classification


The third step of the CRISP-DM methodology is data The Dataset originally had 201 missing values for Body
preparation, which objective is to modify and manage the Mass Index (BMI) feature. These values were filled by
dataset so that it can be handled by the ML algorithms for the calculating the mean BMI for the whole dataset. Also, it was
training process. For this, the data must be split into training observed that more than 30% of the individuals have an
and testing data, so that we have enough information for both Unknown smoking status, which can be also considered as
learning and evaluating the model obtained afterwards. missing data or not having enough information about this
d) Modeling feature values. In order to avoid this data to be omitted
because of the amount of it, it was decided to re-categorize
Several modeling techniques exist and are available those individuals by making some assumptions. As people
within the Data Analysis world. Therefore, it is very vital to younger than 18 years old, are less likely to smoke or have
smoked, the Unknown values present in these individuals
were changed to never. This reduced the number ok
unknowns from 1544 to 909, which were then deleted from
the Dataset. Another re-classification made was to change
every work type values from “children” to “never worked”.
This because children should not have been considered as a
work type in the first place and may imply “never worked”
values.

B. SMOTE for unbalanced dataset


During the first’s visualizations of data, it was observed
that the Dataset was highly unbalanced, as shown in Fig.1.
This means that the positives and negatives of Stroke
outcome are no close from being equitable, which is very
common for disease Datasets. In many cases, this is better to
be avoided before the model learning process, and this is Fig. 3. Stroke value count before SMOTE
why many techniques for resampling data exist. In this case,
Synthetic Minority Over-Sampling Technique (SMOTE) was Hypertension and Heart Disease showed a very similar
used. This method over-samples the minority class by taking behavior in Fig. 4 and Fig. 5, respectively. If none of these
each minority class sample and introducing “synthetic” diseases are negative, there is no clear relationship to Stroke,
examples along the line segments joining any/all of the k but when they are present in the individual, there is more
minority class nearest neighbors [12]. Fig. 2 shows the result likelihood for that person to suffer from stroke.
from this operation.

Fig. 4. Hypertension vs Stroke


Fig. 1. Stroke value count before SMOTE

Fig. 2. Stroke value count after SMOTE

C. Exploratory Data Analysis Fig. 5. Heart disease vs Stroke

The violin plot in Fig. 3 showed that there might be a The remaining features didn’t showed a clear
strong correlation between age and Stroke. This means that relationship, but they must be analyzed as well because that
the Dataset implies that older people is more likely to suffer relationship might not be obvious by simple observation, so
from Stroke. automatic feature selection methods were used.
D. Feature Selection Different ML algorithms were used for the training
Feature selection is an important task prior to preparing process, which accuracies are shown in Table 3.
the data for the training process. If all the features are
considered when training our model, some of them might TABLE III. ML MODELS ACCURACY
introduce noise, limiting the model´s accuracy. Pearson
Correlation technique helped to calculate the correlation Model Accuracy
Decision Tree 0.8994
between features, as represented with a Heatmap as Fig.6. Logistic Regression 0.8049
From this, it was identified a high correlation between Naïve Bayes 0.7954
features ever_married and age. Relation between features is K Neighbors Classifier 0.8724
not desirable because it can cause noise to be present during Random Forest 0.9189
training process, decreasing model accuracy. Therefore, it Neural Network 0.8294
Support Vector Machine 0.8144
was decided to get rid of ever_married feature because of the XGBoost Classifier 0.9059
lower correlation with the target variable.

It was possible to increase the accuracy of most of the


models by performing the training process based on K-folds
instead of simple splitting. These new accuracy values are
listed in Table 3 and compared with help of Boxplots in Fig.
7.

TABLE IV. ML MODELS ACCURACY (K-FOLDS)

Model Accuracy
Decision Tree 0.9011
Logistic Regression 0.8059
Naïve Bayes 0.7907
K Neighbors Classifier 0.8790
Random Forest 0.9232
Neural Network 0.8241
Support Vector Machine 0.8220
XGBoost Classifier 0.9152

Fig. 6. Correlation Matrix between features.

Although this method is widely used for feature selection


by filtering them based on the correlation with target
variable, this method was not used for this task because
implied to only consider age as feature. This is not useful
considering that the objective of this research is to provide
information for Stroke risk reduction which is not possible
with a non-controllable feature as age is. This, Backward
elimination technique was used for selecting the important
features based on the calculated p-values, which are listed in
Table 2. Based on it, BMI was not considered for model
training as well as ever_married which was already
discarded.

TABLE II. BACKWARD ELIMINATION

Feature P-value p < 0.05


Gender 3.417056e-52 X
Age 0.000000e+00 X
Hypertension 1.309137e-19 X
Heart_disease 7.470603e-17 X
Work_type 3.154660e-22 X
Residence type 2.461691e-67 X Fig. 7. Heart disease vs Stroke
Avg_glucosa_level 1.426606e-46 X
BMI 8.737794e-01 Random Forest showed to be the best performing ML
Smoking status 9.476409e-58 X algorithm for this specific problem, followed by XGBoost
and Decision Tree in terms of model accuracy. However,
because of data unbalance previously discussed, this might
E. Training and ML algorithm comparison be not the best metric for evaluating the model. It is
convenient to consider recall and precision of the predictions.
For the modeling process, the dataset had to be split into
Confusion Matrix in Fig. 8 shows an overview of the
training data and testing data. The main point of doing so is
predictions made by the Random Forest model.
that we have enough data for the training process (75%), but
also allowing to have some data for testing and evaluating
the model (25%).
another necessary measure. Age, although not being
controllable, might be considered as an amplifier to Stroke
risk given the strong correlation with it, which should also be
considered by medical staff when identifying risk population.
The Dataset used, although being highly unbalanced and
with many missing information, it was useful for the creation
of the model. Nevertheless, these conditions in which a large
number of values need to be inferred or calculated
synthetically, may not have the best scenario. Specially when
we are dealing with real world problems that can affect
directly to persons lives.

V. CONCLUSIONS
Fig. 8. Random Forest model Confusion matrix
This research proved that through Data Science and the
It can be noted that the model predicted a total of 70 false use of Machine Learning algorithms, it was possible to
negatives and 96 false positives, with an accuracy of 0.917 predict Stroke outcome from known information about
and an F-score of 0.920. A Receiver Operating Characteristic individuals. Also, the CRISP-DM methodology helped as a
(ROC) curve was also built and the area under the ROC guiding through the analysis of the data, which made this
curve (AUC) was calculated. The higher the AUC, the better process much simpler and efficient without loosing track in
is the model at correctly classifying instances. The was an the Business problem itself and taking the right decisions
AUC of 0.975, which is presented in the ROC curve shown based on it.
in Fig.9.
REFERENCES
[1] NHLBI, “What Is a Stroke?,” 2014.
https://web.archive.org/web/20150218230259/http://www.nhlbi.nih.g
ov/health/health-topics/topics/stroke/# (accessed Jun. 06, 2021).
[2] WHO, “The top 10 causes of death.” https://www.who.int/news-
room/fact-sheets/detail/the-top-10-causes-of-death (accessed Jun. 09,
2021).
[3] M. A. Moskowitz, E. H. Lo, and C. Iadecola, “The science of stroke:
Mechanisms in search of treatments,” Neuron, vol. 67, no. 2, pp. 181–
198, 2010, doi: 10.1016/j.neuron.2010.07.002.
[4] T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to
cerebral stroke prediction based on imbalanced medical dataset,”
Artif. Intell. Med., vol. 101, no. September, p. 101723, 2019, doi:
10.1016/j.artmed.2019.101723.
[5] J. K. Kim, Y. J. Choo, and M. C. Chang, “Prediction of Motor
Function in Stroke Patients Using Machine Learning Algorithm:
Fig. 9. ROC curve and AUC score Development of Practical Models,” J. Stroke Cerebrovasc. Dis., vol.
30, no. 8, p. 105856, 2021, doi:
10.1016/j.jstrokecerebrovasdis.2021.105856.
IV. DISCUSSION [6] Y. Hbid, M. Fahey, C. D. A. Wolfe, M. Obaid, and A. Douiri, “Risk
Prediction of Cognitive Decline after Stroke,” J. Stroke Cerebrovasc.
Random Forest seems to have perform very good for Dis., vol. 30, no. 8, p. 105849, 2021, doi:
predicting Stroke outcome with the information that was 10.1016/j.jstrokecerebrovasdis.2021.105849.
provided by the Dataset. As for many models for disease [7] A. Dey, “Machine Learning Algorithms: A Review,” Int. J. Comput.
prediction, the cost of having false negatives is much higher Sci. Inf. Technol., vol. 7, no. 3, pp. 1174–1179, 2016, [Online].
Available: www.ijcsit.com.
than the one if predicting false positives. This is because,
[8] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine
even though the person is not prone to suffering from a learning: a review of classification and combining techniques,” Artif.
stroke, taking precaution and the recommended treatment has Intell. Rev., vol. 26, no. 3, pp. 159–190, 2006, doi: 10.1007/s10462-
low negative impact in persons health. On the other hand, if 007-9052-3.
we suppose that we say that the person has not risk of having [9] Z. Usmani, “What is Kaggle, Why I Participate, What is the Impact? |
such disease, we can even put its life in danger. This is why Data Science and Machine Learning,” p. 44916, 2017, Accessed: Jun.
it is very important for the model to be evaluated depending 06, 2021. [Online]. Available: https://www.kaggle.com/getting-
started/44916.
on the Business problem to be solved. The model resulting
[10] S. Raschka, J. Patterson, and C. Nolet, “Machine learning in python:
from this research shown a very low rate of False negative Main developments and technology trends in data science, machine
predictions, so it seems that this model works with good learning, and artificial intelligence,” Inf., vol. 11, no. 4, 2020, doi:
performance. 10.3390/info11040193.
[11] H. G. Ceballos, R. Morales-menendez, and R. A. Ramírez-, “A
From the features considered for the model construction, Research-based Learning Approach to Teach Data Science using
Hypertension and Heart disease are clinical indicators for Covid-19 and Related Domains,” pp. 1–28.
Stroke risk. Therefore, treating these diseases can help in [12] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
reducing it. Also, the Average glucose level is another “SMOTE: Syntethic Minority Over-Sampling Technique,” J. Artif.
clinical parameter that can be controlled and treated by Intell. Res., 2002, doi: 10.1613/jair.953.
medical staff. Smoking is another factor, so avoiding it is

View publication stats

You might also like