Professional Documents
Culture Documents
Stroke Prediction Using Machine Learning in A Distributes Environtment
Stroke Prediction Using Machine Learning in A Distributes Environtment
updates
1 Introduction
FalseAlarm
FalseAlarmRate = S ave L z'f e + F a l se Al arm
To develop a robust system, stroke detection rate should maximize and false
alarm rate should minimize. Hence this trade-off is shown in Fig. 1. As the data is
growing at an alarming rate, the importance of big data is comprehended. With
this, the need for frameworks to process this enormous data is rapidly increasing,
especially in the healthcare field. Apache Hadoop [2, 17] is the emerging big
data technology framework for distributed data storage and parallel processing.
Apache Spark [9, 15] is one of the framework which is designed for fast processing
using in-memory computation.
In this paper, we use Apache spark as a data processing framework. Spark
provides abstraction known as Resilient distributed dataset [18]. Spark executes
operation 10 times faster on disk and 100 times faster in-memory than Hadoop.
240 M. Rajora et al.
Best
.... /
/
/
/
/
Ii
/
z /
0
6 ·tP~/
~
0 I /
/
Q;-'b-j/
/
Eud
,/
0
1/
0 FALSE ALARM 1
The Spark stack currently consists of Spark core engine along with libraries
i.e MLlib for executing machine learning task in spark, Cluster Management
to acquire cluster resource for performing jobs and Spark SQL which combines
relational processing with Spark functional APL
The rest of the paper is organized as follows. In Sect. 2, the related work is
described by comparing it with the proposed system performance. In Sect. 3, an
elaborate description of the proposed solution is presented. It contains details
about the data used and explanation of the system model. Section 4 contains
the experimental analysis and results and also performance analysis of different
models and Sect. 5 concludes the paper.
2 Related Work
In [13], an artificial neural network was used to predict the thrombo-embolic
stroke disease. The dataset consisted of eight important features to be considered
for prediction. Simple ANN model was built which evaluated the accuracy score
of 0.89. It uses a relatively smaller dataset as it should use when working with
deep learning models.
In [14], Support vector machine algorithm with all its four kernels i.e linear,
quadratic, polynomial and RBF were used. The dataset consisted of features
like - age, sex, atrial fibrillation, walking symptoms, face deficit, arm deficit, Leg
deficit, dyphasia, visuospatial disorder, hemianopia, infarct visible on CT and
cerebellar signs. They worked with 300 training samples and 50 testing samples.
Accuracy score of the four kernels are as follows : linear was 91%, quadratic was
81 %, RBF was 59% and polynomial was 87.9%.
In [8], authors used the dataset which consisted the attributes like Sex, Age,
Province, Marital status, Education and occupation. They used three machine
Stroke Prediction Using Machine Learning in a Distributed Environment 241
learning algorithms i.e. naive bayes, decision tree and neural network with six
input layer, one hidden layer and two output layer to predict the stroke. The
accuracy score obtained from decision tree was 75%, by naive bayes was 72%
and by neural network it was 74%.
In [11], the data set consisted of 29074 records, of which 30% was used for
testing and 70% training. The algorithm used was decision tree, random forest
and neural network. The accuracy presented by decision tree was 74.31 %, by
random forest was 74.53% and 75.02% by neural network.
The limitation of the paper [8] and [11] is their accuracy score. The result
depicts the predictions on which person's life depends and risking accuracy in
medical domain costs high.
Among the related works, the accuracy predicted by our model has better
results. We compared six models to choose the best performing model and then
tuned the parameter to reach the state of art. The main contribution of our work
is implanting Gradient Boosting algorithm and then tuning various parameters
to reach an accuracy of 0.9449.
3 Proposed Solution
The objective of the paper is to build the robust system which can accurately
detect whether a person will suffer from stroke or not. For this purpose, machine
learning classification algorithms are applied on processed data in Apache spark
framework. The proposed model works in a pseudo distributed environment. A
Detailed description is given in following sub-sections.
The System architecture explains elaborately about each step as shown in Fig. 2.
It is carried out from importing libraries to predicting whether patients will suffer
from stroke or not. It explains the step by step process of the implementation
242 M. Rajora et al.
Attributes Description
ID Patients ID to avoid duplicity
Gender Gender of patient
Age Age of patient
Hypertension No hypertension-0 Suffering hypertension- I
Heart disease No heart disease-0 Suffering heartdisease-I
Marital status Married or not
Work-type Kind of Occupation person is involved
Residence area Lives in urban or rural area
Avg-Glucose Average glucose level of patient measured after meal
BMI Body Mass Index of patient
Smoking-status Patient smoking status
Stroke status No Stroke-0 Stroke suffered-I
of stroke prediction system architecture. The very first step includes importing
certain spark libraries. As we work with columnar data, we import Spark SQL
libraries. The first point to use spark SQL is an object called Spark Session. It
initializes the spark application and sets up the session where we can process
data. After the session is created we load our data into the session. As the data
is loaded into the session, we perform exploratory data analysis.
Making data
Evaluating Data
Models Ready for
Classifiers Preprocessing
spark Mllib
Exploratory Data Analysis. We analyse the uni-variate plots for six features
such as Age, Avg-Glucose, BMI, heart disease, hypertension and stroke. From
Fig. 3, we can observe that heart disease and hypertension happens to show
correlation with each other. The plot of average glucose is slightly bi-modal in
nature and not normal distribution as it seems to b e. The standard deviation
of average glucose is 43 with an average value of 140. The BMI plot is slightly
Stroke Prediction Using Machine Learning in a Distributed Environment 243
skewed to the right and age is not uniform distribution but comes out that stroke
is common around the age of 42 and a spike for old age and infants.
Next, we check the influence of Gender , BMI and Age on predicting the
stroke percentage. It comes out that age is an important factor to be considered
while evaluating the probability of stroke. In Fig. 4, it shows there is an abrupt
increase in the stroke rate above the age of 40 years and diminishing rate below
20 years. The plots also help us to conclude that BMI is a relevant feature as it
has less number of outliers and also high BMI value contributes to the greater
probability of stroke than people with lower BMI values. Gender is also another
factor, as per the data of 43400 people, 25665 are females and 17724 are males
with females having 2.89% and males having 3.49% chances of getting a stroke.
2000 5000
1000 2500
0 0
0 20 40 fl() 80 so 100 '
I.SO '
200 250 300
bm i heart disease
5000
0
20
hypertension
I
fl()
I
80
~11
100
I
I 0.0 0.2
I I
0.4
stroke
I
0.6
I
0.8
JLO
~ll 0.0
I
0.2
I
0.4 0.6
I I
0.8
J~ll
LO
60 Stroke
] 0
40 • 1
•
0 20 40 60 80 0 20 40 60 80
age age
80
~
60
40
•
. -
. . ,._....,,..~
·:.:
.
....J_.:°::.:ai
•
• •
20
0 20 40 60 80 0 20 40 60 80
age age
stroke
•• 0
•• •r.e• • • 1
• • I\;~••• • •
~..&.
• .• A • e-:r~,.-
• .
0 20 40 60 80 0 20 40 60 80
age age
The effect of Smoking status, BMI and age on stroke is also analysed in
Fig. 5. The smoking status is divided into four parts i.e. unknown, never smoked,
formally smoked and smokes. It reveals that smoking is a factor to be considered
when looked upon smokers versus non-smokers. Present smokers have less BMI
than people who never smoked or older in age. The risk of stroke is highest for
former smokers having age above 40.
Data Cleaning. After the analysis of data in the above section, there is a need
to remove the outliers, duplicate values and also fill the missing values. In the
dataset , it is observed that
Table 2 gives the statistical description of the dataset. It shows total count,
mean, standard deviation, minimum value, maximum value and quartile division.
These quartile are made on the values of features present in the dataset.
Data Preprocessing. As we have cleaned the data, with no outliers and miss-
ing values, now we proceed to eradicate the issue of imbalance. If a class is
imbalanced then the predicted output will always be biased. To address the
issue, SMOTE technique has been used. Instead of directly oversampling the
minority class, it first randomly selects a point from minority class and find the
nearest neighbour. Now, among the neighbours, a point is chosen and a synthetic
instance is created between these two. In this way, enumerable synthetic points
can be generated [3] and the class imbalance can be re-balanced. This is shown
in Figs. 6 and 7.
Preparing Data for PySpark to Process. After the data is processed, PyS-
park does not process the data in this form for implementation of machine learn-
ing algorithms. Machine learning algorithms cannot work solely with categorical
data, it has further requirements which are fulfilled through String Indexer,
One-hot encoder and Vector assembler. These are the steps for machine learning
pipeline [7] as shown in Fig. 8.
Machine Learning Models. For this stroke Prediction Model, we used five
ML models such as Naive Bayes, Logistic Regression, Decision Tree, Random
Forest, Gradient Boosting algorithms.
246 M. Rajora et al.
25000
20000
10000
5000
0
0
Cass
25000
20000
11!
ij 15000
10000
5000
0
dllss
Index
Categorical
Feature --- Encodeto
one hot
vector
Assemble to
a feature
vector
Train Model
In order to evaluate the classifier and find out the most robust and efficient of
all, we aim to maximize the Detection rate as
TP
DetectionRate(Recall) = TP + FN
To maximize Detection Rate term, we need to minimize FN i.e missed life. Also,
we aim to reduce the False alarm rate such as
FP
FalseAlarmRate =TN+ FP
To view the results, we use Receiver Operating Characteristic curve (ROC). Few
parameters for performance criteria are defined as below
TP+TN
Accuracy = T P + FP + FN +TN
. . TP
Precision = F p + T p
TP
Recall= TP + FN
0.975
0.950
0.925
~
0
u 0.900
VI
u
::,
<( 0.875
0.850
0.825
- r ainAUC
0.800 - i!stAUC
r ainAUC
0.975 i!st AUC
0.950
0.925
~
0 0.900
u
VI
u
::, 0.875
<(
0.850
0.825
0.800
0.775
0 200 400 600 800
n estimators
Figure 11 shows the performance of the model with varying the depth of the
tree. Deeper the tree, better it learns. But increasing too much depth increases
the overfitting. Hence the optimal value is 8 for this case. Figure 12 shows the
performance of the model with varying the number of samples required to split
an internal node. The value is varied from 1% to 30% and it is observed the
increasing the value results in underfitting of data. Hence, the optimal value
chosen is 0.04.
Stroke Prediction Using Machine Learning in a Distributed Environment 249
LOOO ra inAUC
estAUC
0.975
0.950
~
0.925
0
u
0.900
u"'
::,
<( 0.875
0.850
0.825
0.800
2 4 6 8 10 12 14
r ee depth
ra in AUC
0.900
0.895
~ 0.890
0
u
u"'
::,
<( 0.885
0.880
0.875
0.8
.;
:.
a::
(I/
-I; 0.6
·a;
fr.
(I/
~
F
~
>
..,
·;::;
-~
(I/
Vl
0.2
- Gaussian Naive Bayes ROC (area = 0.66)
- Decision Tree ROC (area = 0. 79)
- Logistic Regression ROC (area = 0.78)
- Random Forest ROC (area = 0.82)
- Grad ient Boosting ROC (a rea = 0.90)
O.Of - - - ~ - - - ~ - - - ~ - - ~ - - - - - i
0.0 0.2 0.4 0.6 0.8 lO
1-Specificity(False Positive Rate)
/
Receiver Operating Characteristic (ROC))
r--==:::::;;-=:::=:'
--- .:==========~71
lO
0.8
I /
/
/
,/,/
.;
:. /
a:: ,,,,'
(I/
Precision Recall Fl
Class 0 0.89 0.83 0.86
Class 1 0.84 0.90 0.87
Avg/Total 0.87 0.87 0.87
Precision Recall Fl
Class 0 0.93 0.96 0.95
Class 1 0.96 0.93 0.94
Avg/Total 0.95 0.95 0.94
for gradient boosting as described in Fig. 14. Figure 14 gives the insight about
the comparative performance before and after tuning parameters in gradient
boosting algorithm.
5 Conclusion
As the stroke disease is ranked fourth major cause of death in the category
of non-communicable diseases, it is the need of the hour to bring this stroke
prediction system which can predict the chances of whether the person can
suffer from a stroke or not. Based on the results and extensive analysis of the
data, preventive measures can be advised to patients to avoid the chances of
suffering from a stroke. The system is built in a distributed machine learning
environment using the Apache Spark framework. Among various classifiers, the
gradient boosting algorithm performed the best with ROC of 0.88 score before
tuning and 0.9449 score after tuning. The novel work contributed is training
gradient boosting algorithm and improvising the accuracy from 0.867 to 0.9449
and ROC score from 0.90 to 0.95 by optimizing various parameters of gradient
boosting algorithms. The future work will be to attain the state of art results
by training deep learning models for a larger dataset.
References
1. Bates, D.W., Saria, S., Ohno-Machado, L. , Shah, A., Escobar, G.: Big data in
health care: using analytics to identify and manage high-risk and high-cost patients.
Health Aff. 33(7) , 1123-1131 (2014)
2. Borthakur, D.: The Hadoop distributed file system: architecture and design.
Hadoop Proj. Website 11(2007) , 21 (2007)
3. Chawla, N.V., Bowyer, KW., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. lntell. Res. 16, 321- 357 (2002)
252 M. Rajora et al.
4. Chen, M., Hao, Y ., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine
learning over big data from healthcare communities. IEEE Access 5, 8869- 8879
(2017)
5. Donaldson, M.S. , Corrigan, J .M., Kohn, L.T., et al.: To Err is Human: Building a
Safer Health System, vol. 6. National Academies Press, Washington, D.C. (2000)
6. Hafermehl, KT.: High spatial resolution diffusion-weighted imaging (DWI) of
ischemic stroke and transient ischemic attack (TIA) (2016)
7. Haihong, E., Zhou, K., Song, M. : Spark-based machine learning pipeline construc-
tion method. In: 2019 International Conference on Machine Learning and Data
Engineering (iCMLDE), pp. 1- 6. IEEE (2019)
8. Kansadub, T ., Thammaboosadee, S., Kiattisin, S., Jalayondeja, C .: Stroke risk
prediction model based on demographic data. In: 2015 8th Biomedical Engineering
International Conference (BMEiCON), pp. 1- 3. IEEE (2015)
9. Karau, H., Konwinski, A., Wendell, P ., Zaharia, M.: Learning Spark: Lightning-
Fast Big Data Analysis. O 'Reilly Media, Inc. , Sebastopol (2015)
10. Roger, V.L., et al.: Heart disease and stroke statistics- 2012 update: a report
from the American heart association. Circulation 125(1) , e2 (2012) . Writing Group
Members
11. Nwosu, C.S., Dev, S., Bhardwaj , P., Veeravalli, B. , John, D.: Predicting stroke
from electronic health records. In: 2019 41st Annual International Conference of
the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5704- 5707.
IEEE (2019)
12. World Health Organization, et al.: Global status report on noncommunicable dis-
eases 2014. No. WHO/ NMH/NVI/15.1. World Health Organization (2014)
13. Shanthi, D., Sahoo, G ., Saravanan, N.: Designing an artificial neural network model
for the prediction of thrombo-embolic stroke. Int . J. Biometric Bioinform. (IJBB)
3(1), 10- 18 (2009)
14. Singh, M.S., Choudhary, P., Thongam, K .: A comparative analysis for various
stroke prediction techniques. In: Nain, N., Vipparthi, S.K., Raman, B . (eds.) CVIP
2019. CCIS, vol. 1148, pp. 98- 106. Springer, Singapore (2020) . https://doi.org/10.
1007/978-981-15-4018-9_9
15. Apache Spark: Apache spark: lightning-fast cluster computing, pp. 2168-7161
(2016). http://spark.apache.org
16. Subha, P .P ., Geethakumari, S.M.P ., Athira, M., Nujum, Z.T .: Pattern and risk
factors of stroke in the young among stroke patients admitted in medical college
hospital, Thiruvananthapuram. Ann. Indian Acad. Neurol. 18(1) , 20 (2015)
17. White, T .: Hadoop: The Definitive Guide. O 'Reilly Media, Inc., Sebastopol (2012)
18. Zaharia, M ., et al.: Resilient distributed datasets: a fault-tolerant abstraction for
in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Sym-
posium on Networked Systems Design and Implementation ( {NSDI} 2012), pp.
15-28 (2012)
Chock for
updates
1 Introduction
The most common cancer among Indian women is breast cancer. In 2018, more
than 1,60,000 new incidences of breast cancer were reported in India [5]. Ductal
carcinoma is a cancer originating from the cells that line the milk ducts in
breasts. More than 80% of all breast cancer cases are Ductal Carcinomas [4]. The
National Cancer Registry Programme by the Indian Council of Medical Research
estimates the rate of growth per decade of breast cancer in India lies within 15%
and 20% [14]. Additionally, breast cancer prognosis in India is far worse than
© Springer N a ture Switze rla nd AG 2021
D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 253-267, 2021.
https :/ / doi.org/10.1007 /978-3-030-65621-8_16