You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338458550

Stroke Prediction using Distributed Machine Learning Based on Apache Spark

Research · January 2019


DOI: 10.13140/RG.2.2.13478.68162

CITATIONS READS

19 7,834

5 authors, including:

Hager Saleh Sara F Abd-el Ghany


South Valley University South Valley University
29 PUBLICATIONS   446 CITATIONS    7 PUBLICATIONS   95 CITATIONS   

SEE PROFILE SEE PROFILE

Eman M. G. Younis Nahla Omran


Minia University South Valley University
31 PUBLICATIONS   763 CITATIONS    22 PUBLICATIONS   135 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Fake news detection View project

GVN: Gradient Variable Neighborhood Search for Learning to Rank View project

All content following this page was uploaded by Nahla Omran on 08 January 2020.

The user has requested enhancement of the downloaded file.


International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97

Stroke Prediction using Distributed Machine Learning Based on


Apache Spark

Hager Ahmed1, Sara F. Abd-el ghany2, Eman M.G.Youn3, Nahla F.Omran4,


Abdelmgeid A.Ali5
1,2,5
Faculty of Computers and Information, Minia University, Egypt
2,4
Department of Computer Science and Faculty of Science, South Valley
University, Egypt

Abstract
Stroke is one of death causes and one the primary causes of severe long-term weakness
in the world. In this paper, we compare different distributed machine learning algorithms
for stroke prediction on the Healthcare Dataset Stroke. This work is implemented by a big
data platform that is Apache Spark. Apache Spark is one of the most popular big data
platforms that handle big data and includes an MLlib library. MLlib is an API integrated
with Spark to provide machine learning algorithms. Four types of machine learning
classification algorithms were applied; Decision Tree, Support Vector Machine, Random
Forest Classifier, and Logistic Regression were used to build the stroke prediction model.
The hyperparameter tuning and cross-validation were applied with machine learning
algorithms to enhance results. Accuracy, Precision, Recall, and F1-measure were used to
calculate performance measures of machine learning models. The results showed that
Random Forest Classifier has achieved the best accuracy at 90 %.

Keywords: Stroke; Stroke Prediction; Machine Learning; Big Bata; Apache Spark

1. Introduction
Stroke has become one of the most significant threats to public health worldwide [1].
Stroke disease ranks second in terms of life years after heart disease [2, 3].
Stroke is a sudden onset of focal neurological deficits lasting more than 24 hours. And
it is caused by cerebral artery occlusion or atherosclerosis. Signs of stroke appear abruptly
but they often occur gradually. In 2016, in the United States, a dramatic rise in the number
of stroke patients caused a large load on the health care system [4]. Long-stroke
disabilities lead to a physical, mental, and financial burden for patients, their families, and
the community; while it is believed that early detection improves healing and reduces
disabilities [5].
Early prediction of stroke diseases is useful for the prevention or for early treatment
intervention. Machine learning and data mining are playing essential roles in predicting
stroke. For example, support vector machine [6],logistic regression [7], random forest
classifier and neural network [8]. Machine learning is a type of artificial intelligence that
aims to create a computer with human thinking capability. The goal of machine learning
allows computers to make a particular task relying on patterns and interference without
using clear guidance [9].
Big data is large and complex amount of data that cannot be handled using traditional
analysis methods. This data can be in structured, semi-structured, and unstructured forms.
The massive flow of data has led to the need for better analytical methods as traditional
methods have become inefficient for processing big data [10, 11]. Therefore, there are
frameworks display to analysis, store, and process the large amount of data such as
Apache Hadoop [12] and Apache Spark [13].
Apache Spark [13, 14] is an open-source framework for data analytics that provides
fault tolerance and process data in real-time. Spark can work with structured data like

89
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
CSV files and unstructured data such as JSON format. It offers high-level APIs such as
Spark Streaming and MLlib. MLlib is Apache Spark's scalable machine learning library.
It offers different types of machine learning: classification, regression, and clustering
[15].
The contribution of the proposed research is to design a distributed machine-learning-
based on Apache Spark to predict stroke disease. Four machine learning models are used
to predict stroke, which are logistic regression, support vector machine, decision tree, and
random forest. Furthermore, various performance methods such as accuracy, precision,
and recall has been computed. Moreover, data preprocessing techniques were applied on
stroke dataset.
The rest of this paper is organized as follows. Section 2 presents related works. The
proposed system of predicting stroke disease is described in Section 3. Section 4 presents
the experiment results. Finally, Section 5 displays the conclusion of the paper.

2. Related Works
Several researchers used machine learning algorithms for stroke prediction. The
contributions of some research studies are described in this section.
D. Shanthiet al. [16] have used Artificial Neural Networks (ANN) for the prediction of
Thromboembolic stroke disease. The healthcare dataset stroke data with eight important
attributes of a patient have been used. This research work demonstrates ANN based
prediction of stroke disease by improving the accuracy to 89% with a higher consistent
rate. The ANN exhibits the right performance levels for the prediction of stroke disease.
Besides, Kansadub et al. [17] have applied decision trees (DTs), naive Bayes, and
ANN to predict stroke on the healthcare dataset stroke data. The researchers revealed that
DT was the best classifier among the other used methods.
In the same context, Sung et al. [18] compared the performance of kNN, multiple linear
regression (MLR), and a regression tree model to predict the stroke severity; the results
showed that KNN has better accuracy than other models.
Ahmet K. Arslan et al. [19] used Support Vector Machine (SVM), Stochastic Gradient
Boosting (SGB), and penalized logistic regression (PLR) to predict stroke for the
collected dataset from TurgutOzal Medical Centre, Inonu University, Malatya, Turkey.
The findings of the research proved that SVM achieved the highest accuracy of 98%.
Linder et al. [20] have also compared the logistic regression (LR) and the artificial
neural networks (ANNs) for classifying acute ischemic stroke from the Database of
German Stroke. The results of this study showed that LR was the best for the
classification of acute ischemic stroke compared to ANNs.
Khosla et al. [21] have applied the Cox proportional hazards model with the machine
learning method for the prediction of the stroke on the dataset of the Cardiovascular
Health Study. The result showed that support vector machine (SVM) achieved a higher
area under the ROC curve when compared to the Cox proportional hazards model.
Adam et al. [22], have also compared two algorithms decision tree and k-nearest
neighbor (KNN) for classification of the stroke on the dataset from Sugam Multispecialty
Hospital, Kumbakonam, Tamil Nadu, India. And the researchers concluded that the
classification of decision tree performed better than KNN algorithm.
Cheng et al. [23] have also worked on predicting ischemic stroke by using two ANN
models on the dataset from Sugam Multispecialty Hospital, Kumbakonam, Tamil Nadu,
India. And the researchers concluded that the accuracy rates achieved 79.2% and 95.1% .

90
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
Previous studies of stroke disease prediction have only used traditional methods of
machine learning to predict stroke. In our research, we have used distributed machine
learning on spark platform to predict stroke.

3. Martial and Methods


3.1. Database
Healthcare Dataset Stroke [24] was used to train and test models for predicting stroke
disease. This dataset consists of 10 independent variables as features and one dependent
variable as the class label that is used to predict heart disease. The features’ name are
gender, age, hypertension, heart_disease, ever_married, work_type, residence_type,_avg
glucose_level, bmi and smoking status. The class label has two values which are: 0
represents the absence of stroke disease; while the value 1 represents the presence of
stroke disease. Table 1 illustrates the complete information about the features.

Table 1. Features name and description of stroke dataset

#num Features Description


1 Age Age
2 Gender Male and Female
3 Hypertension Hypertension
4 Heart Disease 1 Has heart disease
0 Does not have heart
disease
5 Ever_married 1 means Married
0 means Not married
6 Work_type Children
Private
Never worked
Govt job
Self employed
6 Residence_type Rural
Urban
7 Avg_glucose_level Average glucose level
8 bmi Body mass index
10 smoking_status Never smoked
Formerly smoked

3.2. The proposed system of predicting the stroke disease


Figure 1 below illustrates the architecture of the stroke disease prediction system. This
proposed system includes five stages as follows: 1) loading stroke dataset 2) data pre-
processing, 3) Cross-validation and Hyperparameter Tuning, 4) Classifiers, and 5)
Evaluating Classifiers.

91
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97

Figure 1. The architecture of the stroke disease prediction system.

A) Data pre-processing
Data pre-processing is a primary step for adequately describing the data for the
machine learning algorithm. It is playing an essential role in improving the performance
results of machine learning. In this stage, several steps are applied.
1. Smoking-status and bim features have many missing values. Mean is applied
to fill missing values.
2. Converting categorical features into numerical data using LabelEncoder.
3. The database is imbalanced data. Imbalanced data means there is an
unbalanced ratio of values for each class label. We handle imbalanced data
using random resample techniques.

B) Machine learning algorithms.


In this stage, four types of machine learning are used: Logistic regression
(LR),Random forest classifier (RF), Decision tree (DT), and Support vector machine
(SVM).

 Logistic regression is widely used in many domains, such as the biological


sciences. The logistic regression algorithm is used to find the relationship
between the target and predictive variables. The target variable is binary 0 or 1.
The purpose of our logistic regression algorithm is to find the best fit that is
diagnostically reasonable to describe the relationship between our target variable
and the predictive variables [25].

 The Decision tree is a type of supervised classifiers having a set of rules. A


decision tree has two main parts: The internal nodes make a decision and the leaf
nodes that do not have child nodes and is associated with a label. Decision trees
support various data types in classifying instances [26].

 Random forest is a popular machine learning classifier for developing prediction


models in many research settings. Random forests are a collection of trees which
are constructed using randomly selected training datasets and random subsets of
predictor variables for modeling outcomes. Random forest often gives higher
accuracy compared to a single decision tree model [27].

 A support vector machine is used for both classification and regression problems.
The goal of SVM is to obtain the most suitable hyperplane that can divide the
dataset into two classes, which are 0 and 1 [28] .

92
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
C) Cross-validation and Hyperparameter Tuning
 The hyperparameters are applied to tuning within the machine learning
algorithms [29]. We define a set of values for each hyperparameter for each
class. Then, Grid search method is applied to test each value and select the
best values that achieve the best performance.
 K-Fold Cross-Validation: the dataset is divided within k equal size of fold.
k-1 groups are applied for the training, and the remaining part is utilized to
evaluate the models. In our work, we applied k = 10. In the 10-fold CV
process, 10% of data is used to test the models, and 90% is used to train the
models.
D) Evaluating Classifiers

For evaluating the performance of models, we have used the confusion matrix to
calculate accuracy, precision, recall, and f-measure.

Confusion matrix describes the performance of a model on a set of test data. It gives
two types of correct predictions and two types of incorrect predictions for the classifier
[30]. Table 2 shows the confusion matrix. TP is the predicted output as true positive, TN
is the predicted output as true negative, FP is the predicted output as false positive, and
FN is the predicted output as a false negative. The accuracy, precision, recall, and f-
measure (f-score) are defined in the following:

Table 2. confusion matrix


Predicted Predicted
Class 0 Class 1
Actual Class 0 TP FN
Actual Class 1 FP TN

 Accuracy shows the performance of the classification system as follows:


Accuracy= TP+TN
TP+TN+FP+FN
 Precision is the total number of correctly classified positive divide on the
total number of predicted positive examples [31]. The equation of the
precision is given as follow
Precision= TP
TP+FP
 F-measure is a measurement that represents the relationship between
Precision and Recall. F-Measure will always be nearer to the smaller value
of Precision or Recall [32]. The equation of the f-measure is given as
follows
F-measure = 2*Recall*Precision
Recall + Precision

 Recall: The equation of the recall is given as follows:


Recall= TP
TP+FN

93
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
4. Results of applying machine learning algorithms and Discussion
Four supervised machine learning algorithms were applied to the developed
predictive models, which are SVM, RF, LR, and DT. We applied 10-fold cross-validation
and hyperparameter tuning with machine learning algorithms to improve results. For the
10-fold cross-validation, 10% is used for the testing data, and 90% is used for the training
data. Four different performance measures were used to evaluate the performance of
classification models such as accuracy, recall, precision, and f1-score.

4.1. Experimental setup


The predictive models were developed on Apache Spark and were written in PySpark.
In addition, we used various API libraries that are integrated with Spark. Spark's MLlib is
applied to perform classification algorithms. Also, we have used Python libraries to
handle an unbalanced dataset.
The predictive models were executed on a Spark cluster, which includes one master
node and two worker nodes. Ubuntu 14.04 virtual machines have Java (VM) 16GB of
RAM, seven cores, and 100GB disk that is used to build the cluster.
4.2. Results of applying logistic regression

Table 3 shows the results of precision, recall, and f1-score of applying logistic
regression for each class. For class 0, precision registered the highest percentage at 81%,
while recall registered the lowest rate at 73%. For class 1, f1-score achieved the highest
result at 79%.

Table 3. Results of applying logistic regression


Class precision recall f1-score
0 81 73 76
1 75 82 79

4.3. Results of applying random forest classifier

Table 4 shows the results of precision, recall, and f1-score of applying random forest
classifier for each class. For class 0, precision registered the highest percentage at 961%,
while recall registered the lowest rate at 90%. For class 1, recall achieved the highest
result at 96%, while precision made the lowest result at 79%.

Table 4. Results of applying random forest classifier


Class precision recall f1-score
0 96 85 90
1 87 96 91

4.4. Results of applying decision tree

Table 5 shows the results of precision, recall, and f1-score of applying a decision tree for
each class. For class 0, precision recorded the highest percentage at 82%, while recall
registered the lowest rate at 75%. For class 1, recall achieved the highest rate at 84%.

Table 5. Results of applying decision tree

Class precision recall f1-score


0 82 75 79
1 77 84 81

94
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97

4.5. Results of applying a linear support vector machine

Table 3 presents the results of precision, recall, and f1-score of applying a linear
support vector machine for each class. For class 0, precision registered the highest
percentage at 81%, while recall registered the lowest percentage at 72%. For class 1, f1-
score scored the highest rate at 79%.

Table 6. Results of applying a linear support vector machine


Class precision recall f1-score
0 81 72 76
1 75 83 79

4.6. Discussion
Figure 2 shows the accuracy of applying LR, RF, DT and SVM. The random forest
recorded the highest accuracy of 90%. The decision tree registered the second-highest
accuracy at 79%. The support vector machine and logistic regression techniques recorded
the same accuracy at 77%.

Figure 2. Accuracy of applying machine learning algorithms

5. Conclusion
Stroke disease ranks second in terms of life years after heart disease. Machine learning
plays an essential role in predicting stroke. The proposed stroke prediction system is
developed on Apache Spark. It used distributed machine learning to train and test the
models. It consists of five stages, which are loading stoke dataset, data pre-processing,
cross-validation and hyperparameter tuning, classifiers, and evaluating classifiers. The
results showed that random forest classifier achieved the best accuracy result at 90%.
References
1. Katan, M. and A. Luft. Global burden of stroke. in Seminars in neurology. 2018. Thieme
Medical Publishers.
2. Feigin, V.L., B. Norrving, and G.A. Mensah, Global burden of stroke. Circulation
research, 2017. 120(3): p. 439-448.

95
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
3. Naghavi, M., et al., Global, regional, and national age-sex specific mortality for 264
causes of death, 1980–2016: a systematic analysis for the Global Burden of Disease
Study 2016. The Lancet, 2017. 390(10100): p. 1151-1210.
4. Mozaffarian, D., et al., Heart disease and stroke statistics-2016 update a report from the
American Heart Association. Circulation, 2016. 133(4): p. e38-e48.
5. Veerbeek, J.M., et al., Early prediction of outcome of activities of daily living after stroke:
a systematic review. Stroke, 2011. 42(5): p. 1482-1488.
6. Swethalakshmi, H., et al. Online handwritten character recognition of Devanagari and
Telugu Characters using support vector machines. 2006.
7. Al-Talqani, H.M., Dyslipidemia and Cataract in Adult Iraqi Patients. EC Ophthalmology,
2017. 5: p. 162-171.
8. McKinley, R., et al., Fully automated stroke tissue estimation using random forest
classifiers (FASTER). Journal of Cerebral Blood Flow & Metabolism, 2017. 37(8): p.
2728-2741.
9. Jos Timanta Tarigan, C.L.G., Elviawaty Muisa Zamzami, A REVIEW ON APPLYING
MACHINE LEARNING IN GAME INDUSTRY International Journal of Advanced Science
and Technology, 2019-09-27 28(2).
10. Saiteja Myla, S.T.M., K Karthikeya ,Preetham.B , SK Hasane Ahammad, The Rise of
“Big Data” in the field of Cloud Analytics. International Journal of Advanced Science
and Technology, 2019. 28(8).
11. Ara, A. and A. Ara, Beyond Hadoop: The Paradigm Shift of Data From Stationary to
Streaming Data for Data Analytics.
12. Hadoop, A. Apache Hadoop. [cited 2019; Available from: https://hadoop.apache.org/.
13. Spark, A. Apache Spark. [cited 2019; Available from: https://spark.apache.org/.
14. Ahmed, H., et al., Heart disease identification from patients’ social posts, machine
learning solution on Spark. Future Generation Computer Systems, 2019.
15. Meng, X., et al., Mllib: Machine learning in apache spark. The Journal of Machine
Learning Research, 2016. 17(1): p. 1235-1241.
16. Shanthi, D., G. Sahoo, and N. Saravanan, Designing an artificial neural network model
for the prediction of thrombo-embolic stroke. International Journals of Biometric and
Bioinformatics (IJBB), 2009. 3(1): p. 10-18.
17. Kansadub, T., et al. Stroke risk prediction model based on demographic data. in 2015 8th
Biomedical Engineering International Conference (BMEiCON). 2015. IEEE.
18. Sung, S.-F., et al., Developing a stroke severity index based on administrative data was
feasible using data mining techniques. Journal of clinical epidemiology, 2015. 68(11): p.
1292-1300.
19. Arslan, A.K., C. Colak, and M.E. Sarihan, Different medical data mining approaches
based prediction of ischemic stroke. Computer methods and programs in biomedicine,
2016. 130: p. 87-92.
20. Linder, R., et al., Two models for outcome prediction. Methods of information in
medicine, 2006. 45(05): p. 536-540.
21. Khosla, A., et al. An integrated machine learning approach to stroke prediction. in
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery
and data mining. 2010. ACM.
22. Adam, S.Y., A. Yousif, and M.B. Bashir, Classification of ischemic stroke using machine
learning algorithms. Int J Comput Appl, 2016. 149(10): p. 26-31.
23. Cheng, C.-A., Y.-C. Lin, and H.-W. Chiu. Prediction of the prognosis of ischemic stroke
patients after intravenous thrombolysis using artificial neural networks. in ICIMTH.
2014.
24. healthcare dataset stroke data. [cited 2019; Available from:
https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data.
25. Zhu, C., C.U. Idemudia, and W. Feng, Improved logistic regression model for diabetes
prediction by integrating PCA and K-means techniques. Informatics in Medicine
Unlocked, 2019: p. 100179.
26. Witten, I.H., et al., Data Mining: Practical machine learning tools and techniques. 2016:
Morgan Kaufmann.
27. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.
28. Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.

96
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
29. Claesen, M., et al. Hyperparameter tuning in python using optunity. in Proceedings of the
International Workshop on Technical Computing for Machine Learning and
Mathematical Engineering. 2014.
30. Haq, A.U., et al., A hybrid intelligent system framework for the prediction of heart disease
using machine learning algorithms. Mobile Information Systems, 2018. 2018.
31. Davis, J. and M. Goadrich. The relationship between Precision-Recall and ROC curves.
in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.
32. Chai, K.M.A. Expectation of F-measures: Tractable exact computation and some
empirical observations of its properties. in SIGIR. 2005.

Authors
Hager Ahmed obtained a Master’s degree in Computer Science in 2017. I
obtained a Bachelor’s degree in Information systems from the Faculty of Computer
and Information, University Assuit, Egypt. I am a researcher member of the Big
Data Team in Egypt. My research interests are centered on Big data Analytics,
Data Mining, Sentiment Analysis, Natural Language Processing, Machine
Learning, and Streaming Data.
Sara F. Abd-el ghany works in south valley university. I obtained a
Bachelor’s degree in computer science from the Faculty of science, south valley
university, Egypt. aobtained a Master’s degree in Computer Science in 2016. My
research interests are centered on Big data Analytics, Data Mining, and Machine
Learning.
Eman Younis is currently working as an Associate Professor at Minia
University, Faculty of Computers and Information, Information Systems
Department. She got her B.Sc. degree from Zagazig University, Egypt, 2002. She
obtained her MSc degree from Meunofia University, Egypt in 2007. She received
her Ph.D. degree from Cardiff University, UK in 2014. She spent some time as
post-doc at Nottingham Trent University, UK. Her research interests are machine
learning, data mining, Geo-spatial data processing, semantic web, sentiment
analysis and emotion recognition.

Dr. Nahla F.Omran is currently working as lecturer of Computer Science,


Faculty of Science, South Valley University. She has published over 15 research
papers in prestigious international journals, and conference proceedings.. She has
supervised over 15 Ph.D. and M.Sc. students. Dr. Nahla interests are Big Data,
Machine learning, Algorithms, Cloud Computing , IoT, Data Science, image
processing and data mining.

Abdelmgeid A. Ali is a Professor in Computer Science Department, Minia


University, El Minia , Egypt. He has published over 80 research papers in
prestigious international journals, and conference proceedings. He has supervised
over 60 Ph.D. and M.Sc. Students. Prof Ali is a member of the International Journal
of Information Theories and Applications (ITA). Prof Ali interests are Information
Retrieval, Software Engineering, Image Processing, Data security, metaheuristics,
IOT, Digital Image Steganography, Data Warehousing.

97
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC

View publication stats

You might also like