Professional Documents
Culture Documents
Stroke Prediction Using Distributed Machine Learning Based On Apache Spark
Stroke Prediction Using Distributed Machine Learning Based On Apache Spark
net/publication/338458550
CITATIONS READS
19 7,834
5 authors, including:
Some of the authors of this publication are also working on these related projects:
GVN: Gradient Variable Neighborhood Search for Learning to Rank View project
All content following this page was uploaded by Nahla Omran on 08 January 2020.
Abstract
Stroke is one of death causes and one the primary causes of severe long-term weakness
in the world. In this paper, we compare different distributed machine learning algorithms
for stroke prediction on the Healthcare Dataset Stroke. This work is implemented by a big
data platform that is Apache Spark. Apache Spark is one of the most popular big data
platforms that handle big data and includes an MLlib library. MLlib is an API integrated
with Spark to provide machine learning algorithms. Four types of machine learning
classification algorithms were applied; Decision Tree, Support Vector Machine, Random
Forest Classifier, and Logistic Regression were used to build the stroke prediction model.
The hyperparameter tuning and cross-validation were applied with machine learning
algorithms to enhance results. Accuracy, Precision, Recall, and F1-measure were used to
calculate performance measures of machine learning models. The results showed that
Random Forest Classifier has achieved the best accuracy at 90 %.
Keywords: Stroke; Stroke Prediction; Machine Learning; Big Bata; Apache Spark
1. Introduction
Stroke has become one of the most significant threats to public health worldwide [1].
Stroke disease ranks second in terms of life years after heart disease [2, 3].
Stroke is a sudden onset of focal neurological deficits lasting more than 24 hours. And
it is caused by cerebral artery occlusion or atherosclerosis. Signs of stroke appear abruptly
but they often occur gradually. In 2016, in the United States, a dramatic rise in the number
of stroke patients caused a large load on the health care system [4]. Long-stroke
disabilities lead to a physical, mental, and financial burden for patients, their families, and
the community; while it is believed that early detection improves healing and reduces
disabilities [5].
Early prediction of stroke diseases is useful for the prevention or for early treatment
intervention. Machine learning and data mining are playing essential roles in predicting
stroke. For example, support vector machine [6],logistic regression [7], random forest
classifier and neural network [8]. Machine learning is a type of artificial intelligence that
aims to create a computer with human thinking capability. The goal of machine learning
allows computers to make a particular task relying on patterns and interference without
using clear guidance [9].
Big data is large and complex amount of data that cannot be handled using traditional
analysis methods. This data can be in structured, semi-structured, and unstructured forms.
The massive flow of data has led to the need for better analytical methods as traditional
methods have become inefficient for processing big data [10, 11]. Therefore, there are
frameworks display to analysis, store, and process the large amount of data such as
Apache Hadoop [12] and Apache Spark [13].
Apache Spark [13, 14] is an open-source framework for data analytics that provides
fault tolerance and process data in real-time. Spark can work with structured data like
89
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
CSV files and unstructured data such as JSON format. It offers high-level APIs such as
Spark Streaming and MLlib. MLlib is Apache Spark's scalable machine learning library.
It offers different types of machine learning: classification, regression, and clustering
[15].
The contribution of the proposed research is to design a distributed machine-learning-
based on Apache Spark to predict stroke disease. Four machine learning models are used
to predict stroke, which are logistic regression, support vector machine, decision tree, and
random forest. Furthermore, various performance methods such as accuracy, precision,
and recall has been computed. Moreover, data preprocessing techniques were applied on
stroke dataset.
The rest of this paper is organized as follows. Section 2 presents related works. The
proposed system of predicting stroke disease is described in Section 3. Section 4 presents
the experiment results. Finally, Section 5 displays the conclusion of the paper.
2. Related Works
Several researchers used machine learning algorithms for stroke prediction. The
contributions of some research studies are described in this section.
D. Shanthiet al. [16] have used Artificial Neural Networks (ANN) for the prediction of
Thromboembolic stroke disease. The healthcare dataset stroke data with eight important
attributes of a patient have been used. This research work demonstrates ANN based
prediction of stroke disease by improving the accuracy to 89% with a higher consistent
rate. The ANN exhibits the right performance levels for the prediction of stroke disease.
Besides, Kansadub et al. [17] have applied decision trees (DTs), naive Bayes, and
ANN to predict stroke on the healthcare dataset stroke data. The researchers revealed that
DT was the best classifier among the other used methods.
In the same context, Sung et al. [18] compared the performance of kNN, multiple linear
regression (MLR), and a regression tree model to predict the stroke severity; the results
showed that KNN has better accuracy than other models.
Ahmet K. Arslan et al. [19] used Support Vector Machine (SVM), Stochastic Gradient
Boosting (SGB), and penalized logistic regression (PLR) to predict stroke for the
collected dataset from TurgutOzal Medical Centre, Inonu University, Malatya, Turkey.
The findings of the research proved that SVM achieved the highest accuracy of 98%.
Linder et al. [20] have also compared the logistic regression (LR) and the artificial
neural networks (ANNs) for classifying acute ischemic stroke from the Database of
German Stroke. The results of this study showed that LR was the best for the
classification of acute ischemic stroke compared to ANNs.
Khosla et al. [21] have applied the Cox proportional hazards model with the machine
learning method for the prediction of the stroke on the dataset of the Cardiovascular
Health Study. The result showed that support vector machine (SVM) achieved a higher
area under the ROC curve when compared to the Cox proportional hazards model.
Adam et al. [22], have also compared two algorithms decision tree and k-nearest
neighbor (KNN) for classification of the stroke on the dataset from Sugam Multispecialty
Hospital, Kumbakonam, Tamil Nadu, India. And the researchers concluded that the
classification of decision tree performed better than KNN algorithm.
Cheng et al. [23] have also worked on predicting ischemic stroke by using two ANN
models on the dataset from Sugam Multispecialty Hospital, Kumbakonam, Tamil Nadu,
India. And the researchers concluded that the accuracy rates achieved 79.2% and 95.1% .
90
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
Previous studies of stroke disease prediction have only used traditional methods of
machine learning to predict stroke. In our research, we have used distributed machine
learning on spark platform to predict stroke.
91
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
A) Data pre-processing
Data pre-processing is a primary step for adequately describing the data for the
machine learning algorithm. It is playing an essential role in improving the performance
results of machine learning. In this stage, several steps are applied.
1. Smoking-status and bim features have many missing values. Mean is applied
to fill missing values.
2. Converting categorical features into numerical data using LabelEncoder.
3. The database is imbalanced data. Imbalanced data means there is an
unbalanced ratio of values for each class label. We handle imbalanced data
using random resample techniques.
A support vector machine is used for both classification and regression problems.
The goal of SVM is to obtain the most suitable hyperplane that can divide the
dataset into two classes, which are 0 and 1 [28] .
92
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
C) Cross-validation and Hyperparameter Tuning
The hyperparameters are applied to tuning within the machine learning
algorithms [29]. We define a set of values for each hyperparameter for each
class. Then, Grid search method is applied to test each value and select the
best values that achieve the best performance.
K-Fold Cross-Validation: the dataset is divided within k equal size of fold.
k-1 groups are applied for the training, and the remaining part is utilized to
evaluate the models. In our work, we applied k = 10. In the 10-fold CV
process, 10% of data is used to test the models, and 90% is used to train the
models.
D) Evaluating Classifiers
For evaluating the performance of models, we have used the confusion matrix to
calculate accuracy, precision, recall, and f-measure.
Confusion matrix describes the performance of a model on a set of test data. It gives
two types of correct predictions and two types of incorrect predictions for the classifier
[30]. Table 2 shows the confusion matrix. TP is the predicted output as true positive, TN
is the predicted output as true negative, FP is the predicted output as false positive, and
FN is the predicted output as a false negative. The accuracy, precision, recall, and f-
measure (f-score) are defined in the following:
93
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
4. Results of applying machine learning algorithms and Discussion
Four supervised machine learning algorithms were applied to the developed
predictive models, which are SVM, RF, LR, and DT. We applied 10-fold cross-validation
and hyperparameter tuning with machine learning algorithms to improve results. For the
10-fold cross-validation, 10% is used for the testing data, and 90% is used for the training
data. Four different performance measures were used to evaluate the performance of
classification models such as accuracy, recall, precision, and f1-score.
Table 3 shows the results of precision, recall, and f1-score of applying logistic
regression for each class. For class 0, precision registered the highest percentage at 81%,
while recall registered the lowest rate at 73%. For class 1, f1-score achieved the highest
result at 79%.
Table 4 shows the results of precision, recall, and f1-score of applying random forest
classifier for each class. For class 0, precision registered the highest percentage at 961%,
while recall registered the lowest rate at 90%. For class 1, recall achieved the highest
result at 96%, while precision made the lowest result at 79%.
Table 5 shows the results of precision, recall, and f1-score of applying a decision tree for
each class. For class 0, precision recorded the highest percentage at 82%, while recall
registered the lowest rate at 75%. For class 1, recall achieved the highest rate at 84%.
94
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
Table 3 presents the results of precision, recall, and f1-score of applying a linear
support vector machine for each class. For class 0, precision registered the highest
percentage at 81%, while recall registered the lowest percentage at 72%. For class 1, f1-
score scored the highest rate at 79%.
4.6. Discussion
Figure 2 shows the accuracy of applying LR, RF, DT and SVM. The random forest
recorded the highest accuracy of 90%. The decision tree registered the second-highest
accuracy at 79%. The support vector machine and logistic regression techniques recorded
the same accuracy at 77%.
5. Conclusion
Stroke disease ranks second in terms of life years after heart disease. Machine learning
plays an essential role in predicting stroke. The proposed stroke prediction system is
developed on Apache Spark. It used distributed machine learning to train and test the
models. It consists of five stages, which are loading stoke dataset, data pre-processing,
cross-validation and hyperparameter tuning, classifiers, and evaluating classifiers. The
results showed that random forest classifier achieved the best accuracy result at 90%.
References
1. Katan, M. and A. Luft. Global burden of stroke. in Seminars in neurology. 2018. Thieme
Medical Publishers.
2. Feigin, V.L., B. Norrving, and G.A. Mensah, Global burden of stroke. Circulation
research, 2017. 120(3): p. 439-448.
95
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
3. Naghavi, M., et al., Global, regional, and national age-sex specific mortality for 264
causes of death, 1980–2016: a systematic analysis for the Global Burden of Disease
Study 2016. The Lancet, 2017. 390(10100): p. 1151-1210.
4. Mozaffarian, D., et al., Heart disease and stroke statistics-2016 update a report from the
American Heart Association. Circulation, 2016. 133(4): p. e38-e48.
5. Veerbeek, J.M., et al., Early prediction of outcome of activities of daily living after stroke:
a systematic review. Stroke, 2011. 42(5): p. 1482-1488.
6. Swethalakshmi, H., et al. Online handwritten character recognition of Devanagari and
Telugu Characters using support vector machines. 2006.
7. Al-Talqani, H.M., Dyslipidemia and Cataract in Adult Iraqi Patients. EC Ophthalmology,
2017. 5: p. 162-171.
8. McKinley, R., et al., Fully automated stroke tissue estimation using random forest
classifiers (FASTER). Journal of Cerebral Blood Flow & Metabolism, 2017. 37(8): p.
2728-2741.
9. Jos Timanta Tarigan, C.L.G., Elviawaty Muisa Zamzami, A REVIEW ON APPLYING
MACHINE LEARNING IN GAME INDUSTRY International Journal of Advanced Science
and Technology, 2019-09-27 28(2).
10. Saiteja Myla, S.T.M., K Karthikeya ,Preetham.B , SK Hasane Ahammad, The Rise of
“Big Data” in the field of Cloud Analytics. International Journal of Advanced Science
and Technology, 2019. 28(8).
11. Ara, A. and A. Ara, Beyond Hadoop: The Paradigm Shift of Data From Stationary to
Streaming Data for Data Analytics.
12. Hadoop, A. Apache Hadoop. [cited 2019; Available from: https://hadoop.apache.org/.
13. Spark, A. Apache Spark. [cited 2019; Available from: https://spark.apache.org/.
14. Ahmed, H., et al., Heart disease identification from patients’ social posts, machine
learning solution on Spark. Future Generation Computer Systems, 2019.
15. Meng, X., et al., Mllib: Machine learning in apache spark. The Journal of Machine
Learning Research, 2016. 17(1): p. 1235-1241.
16. Shanthi, D., G. Sahoo, and N. Saravanan, Designing an artificial neural network model
for the prediction of thrombo-embolic stroke. International Journals of Biometric and
Bioinformatics (IJBB), 2009. 3(1): p. 10-18.
17. Kansadub, T., et al. Stroke risk prediction model based on demographic data. in 2015 8th
Biomedical Engineering International Conference (BMEiCON). 2015. IEEE.
18. Sung, S.-F., et al., Developing a stroke severity index based on administrative data was
feasible using data mining techniques. Journal of clinical epidemiology, 2015. 68(11): p.
1292-1300.
19. Arslan, A.K., C. Colak, and M.E. Sarihan, Different medical data mining approaches
based prediction of ischemic stroke. Computer methods and programs in biomedicine,
2016. 130: p. 87-92.
20. Linder, R., et al., Two models for outcome prediction. Methods of information in
medicine, 2006. 45(05): p. 536-540.
21. Khosla, A., et al. An integrated machine learning approach to stroke prediction. in
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery
and data mining. 2010. ACM.
22. Adam, S.Y., A. Yousif, and M.B. Bashir, Classification of ischemic stroke using machine
learning algorithms. Int J Comput Appl, 2016. 149(10): p. 26-31.
23. Cheng, C.-A., Y.-C. Lin, and H.-W. Chiu. Prediction of the prognosis of ischemic stroke
patients after intravenous thrombolysis using artificial neural networks. in ICIMTH.
2014.
24. healthcare dataset stroke data. [cited 2019; Available from:
https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data.
25. Zhu, C., C.U. Idemudia, and W. Feng, Improved logistic regression model for diabetes
prediction by integrating PCA and K-means techniques. Informatics in Medicine
Unlocked, 2019: p. 100179.
26. Witten, I.H., et al., Data Mining: Practical machine learning tools and techniques. 2016:
Morgan Kaufmann.
27. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.
28. Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.
96
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC
International Journal of Advanced Science and Technology
Vol. 28, No. 15, (2019), pp. 89-97
29. Claesen, M., et al. Hyperparameter tuning in python using optunity. in Proceedings of the
International Workshop on Technical Computing for Machine Learning and
Mathematical Engineering. 2014.
30. Haq, A.U., et al., A hybrid intelligent system framework for the prediction of heart disease
using machine learning algorithms. Mobile Information Systems, 2018. 2018.
31. Davis, J. and M. Goadrich. The relationship between Precision-Recall and ROC curves.
in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.
32. Chai, K.M.A. Expectation of F-measures: Tractable exact computation and some
empirical observations of its properties. in SIGIR. 2005.
Authors
Hager Ahmed obtained a Master’s degree in Computer Science in 2017. I
obtained a Bachelor’s degree in Information systems from the Faculty of Computer
and Information, University Assuit, Egypt. I am a researcher member of the Big
Data Team in Egypt. My research interests are centered on Big data Analytics,
Data Mining, Sentiment Analysis, Natural Language Processing, Machine
Learning, and Streaming Data.
Sara F. Abd-el ghany works in south valley university. I obtained a
Bachelor’s degree in computer science from the Faculty of science, south valley
university, Egypt. aobtained a Master’s degree in Computer Science in 2016. My
research interests are centered on Big data Analytics, Data Mining, and Machine
Learning.
Eman Younis is currently working as an Associate Professor at Minia
University, Faculty of Computers and Information, Information Systems
Department. She got her B.Sc. degree from Zagazig University, Egypt, 2002. She
obtained her MSc degree from Meunofia University, Egypt in 2007. She received
her Ph.D. degree from Cardiff University, UK in 2014. She spent some time as
post-doc at Nottingham Trent University, UK. Her research interests are machine
learning, data mining, Geo-spatial data processing, semantic web, sentiment
analysis and emotion recognition.
97
ISSN: 2005-4238 IJAST
Copyright ⓒ 2019 SERSC