You are on page 1of 19

Neural Processing Letters

https://doi.org/10.1007/s11063-020-10279-8

DRFS: Detecting Risk Factor of Stroke Disease from Social


Media Using Machine Learning Techniques

S. Pradeepa1 · K. R. Manjula1 · S. Vimal2 · Mohammad S. Khan3   ·


Naveen Chilamkurti4 · Ashish Kr. Luhach5

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
In general humans are said to be social animals. In the huge expanded internet, it’s really
difficult to detect and find out useful information about a medical illness. In anticipation
of more definitive studies of a causal organization between stroke risk and social network,
It would be suitable to help social individuals to detect the risk of stroke. In this work,
a DRFS methodology is proposed to find out the various symptoms associated with the
stroke disease and preventive measures of a stroke disease from the social media content.
We have defined an architecture for clustering tweets based on the content using Spectral
Clustering an iterative fashion. The class label detection is furnished with the use of high-
est TF-IDF value words. The resultant clusters obtained as the output of spectral clustering
is prearranged as input to the Probability Neural Network (PNN) to get the suitable class
labels and their probabilities. Find Frequent word set using support count measure from
the group of clusters for identify the risk factors of stroke. We found that the anticipated
approach is able to recognize new symptoms and causes that are not listed in the World
Health Organization (WHO), Mayo Clinic and National Health Survey (NHS). It is marked
that they get associated with precise outcomes portray real statistics. This type of experi-
ments will empower health organization, doctors and Government segments to keep track
of stroke diseases. Experimental results shows the causes preventive measures, high and
low risk factors of stroke diseases.

Keywords  Stroke disease symptoms identification · Social media content mining ·


Natural language processing · Spectral Clustering · PNN (probabilistic neural network) ·
Association rule mining

1 Introduction

According to the survey conducted by the World Health Organization (WHO), stroke
is found to be the second deadliest disease which contributes to about 10.8% of all the
deaths. American heart association declared that the Stroke is the Number 5 cause of

* Mohammad S. Khan
adhoc.khan@gmail.com
Extended author information available on the last page of the article

13
Vol.:(0123456789)
S. Pradeepa et al.

death and a leading cause of disability in the United States. Medical analysis is an inter-
esting and fast growing field in healthcare.
Health Informatics and Clinical Analytics depend heavily on information gathered
from diverse sources. Micro blogging websites have evolved to be a source of varied
kind of information. Twitter is being one such site which allows its users to post short
messages which size up to 140 characters. The site brings together millions of people
around the globe to express their opinions on various domains. Medical information in
such blogs discusses the experience and real-time incidents that a patient or a surgeon
has gone through.
Various attempts have been made to get useful information from Twitter content regard-
ing medicinal care [1]. Few works include gathering information from social media to
study a disease [2]. Using full posterior predictive distributions model, they able to identify
the dengue during the peak dengue season from the social media [3].
Spreading of mosquito-borne disease in different places has to be prevented from the
social media content and tracking of these diseases in different area has to be found with
limited resources [4]. Epidemic resilience of Zika virus can be analyzed using twitter
health communication. This study investigated influences of multimedia tweets on retweet
ablity during public health crises, in order to provide evidence for facilitating effective use
of Twitter for health communication [5].
The challengers are

• Nearly 80% data growth comes from unstructured data. Remote devices, social media
and cell phones are storing information for later analysis. This digital data outburst is
exponential so managing this kinds of data is critical.
• Handling of noisy text in the twitter data set. Different peoples use word differently and
Methodology will produce noisy or irrelevant data after following the pre processing
step. It will affect the inference result so the process is highly challengable.
• Extracting the correct inferences related to the real time medical forum. After getting
extracting the inferences from the methodology, verification of the inferences from
some real time medical forum for finding the accuracy of an algorithm is to be done.

In order to address the above mentioned challenges, this paper proposes the framework for
detecting the risk factor of stroke diseases.
Main contributions of this paper are

• Data can be extracted from real time twitter API. So data may not be duplicated
• The work is proposed with NLTK tool for preprocessing for handling the unstructured
data set. It removes unwanted words from text data set
• Iterate execution of spectral clustering remove the noisy and fake content from the twit-
ter data set.
• Applying the Apriori algorithm finding the word connection easily. From that we can
extract the inferences easily
• Result can be mapped some real time forum like WHO,NHS,google.

The DRFS methodology can be used for detecting the risk factor of stroke disease from the
social media resources like twitter, Facebook, etc. This Architecture provides a twitter con-
tent based disease inspection and assist in identification of causes and preventive measure
of stroke related problem. The DRFS surveillance process model has been used for detect-
ing dispersal of stoke diseases that has been affected in society.

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

The rest of the paper is organized as follows: Sect.  1 discusses the introduction.
Section 2 shows Related works. Section 3 describes the data acquisition and preproc-
essing the tweet data, apply iterative spectral clustering for document categorizations,
identifies the label of the cluster using PNN classifier, apply APRIORI for generating
frequent word set. Section 4 shows experimental results and analysis and Sect. 5 dis-
cusses the conclusion and future.

2 Related Work

The integration of health issues in social media content is an important parameter for
detecting risk factor of diseases in real world [6]. Research in detection of medical ill-
ness from the different media sources have prolonged in number because of the devel-
opment in convenience of concurrent data [7]. Biomedical writing forums, web medic-
inal news, and social media plat-forms share their information related to the diseases
[8, 9]. Natural Language Processing (NLP) techniques is providing a affluent source
of information for identifying diseases in all around the world from different media
sources. To improve public health inspection and interventions, it examine three prin-
cipal systems that process, event-based outburst information: EpiSPIDER, HealthMap,
and Global Public Health Intelligence [10, 11].
Brain MRI radiology reports into acute ischemic stroke (AIS) and non-AIS phe-
notypes can be implemented using natural language processing (NLP) and machine
learning (ML) algorithms [12, 13]
A very customary usage of PNN has been made use in text classification to deter-
mine the best possible category of the text. PNN outperforms compare to ML-KNN
[14].To the best of our efforts, we found that PNN along with spectral clustering has
not yet been applied or recognized for acquiring causes and preventive measures of a
disease with social media data. Algorithms have been made based on k-means cluster-
ing and probabilistic neural networks for classifying the industrial system faults [15].
Spectral clustering technique is employed in SOM (self-organizing maps), a powerful
tool for knowledge discovery and cluster extraction [16].

3 DRFS Architecture

3.1 Overview

We propose the DRFS model that identifies the risk factors of stroke diseases from the
social media. The steps involved are. 1. Collection of twitter data from twitter API. 2.
Apply the spectral clustering in iterative fashion. 3. Detect the outliers and remove the
fake and unwanted tweets from the analysis. 4. Extract the essential features from Term
Frequency and Inverse Document frequency. 5. Apply Probability Neural Network for
detecting the label of the clusters. 6. Group the cluster documents based on the label.
7. Generate frequent word set using APRIORI Algorithm. 8. Detect the risk factors of
stroke from the frequent word set. Figure 1 portrays the proposed DRFS Architecture.

13
S. Pradeepa et al.

Fig. 1  DRFS architecture

3.2 Data Gathering and Pre‑processing

In this, the procedure for data acquisition and pre-processing has been demonstrated. Twit-
ter data has been fetched using Twitter API [17]. For gathering the twitter data, the Twit-
ter keywords are the important variables. For an effective observation, it is imperative to
identify the relevant tweets which detect the risk factors of stroke disease with the causes
and prevention. Essential keywords are { #stroke, #heart attack} and 20,000 tweets are
self-possessed [18, 19]. For decision [20] n making, each and every words are important
in tweets [21, 22]. Tweet message contains noisy text, other languages, mis-spellings and
slang. Pre-processing of tweets is an essential job for visualizing the proper inferences. In
order to undertake the troubles with the noise in tweets, text processing methods have been
applied such as tokenization, Stop word removal, Stemming, Lemmatization, Frequency
based methods (TF-IDF calculation) [23]. To pre-process the tweets with high quality, it
calculates smoothed TF-IDF measure for each word.

3.2.1 Term Frequency

A text cannot be processed directly unless it can be expressed in proper numerical val-
ues. This can be done by considering the important words in the document [24]. Term
Frequency is the ratio of the frequency of a word t in a document d to the total number of
words in the document. TF is calculated by (1).
( )
f(t,d)
TF (t, d) = (1)
number of words in d

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

3.2.2 Inverse Document Frequency (IDF)

This is to weight down the commonly used words throughout the entire corpus and to increase
the weight of the words that are not used very much in a collection of documents. If the same
word occurs in many documents with more frequency, then those have to be ignored by IDF
measure. It provides the rareness of the word. The inverse document frequency is calculated
by (2).
( )
ndocuments
IDF(t) = ln (2)
ndocuments containing t

3.2.3 Term Frequency‑Inverse Document Frequency (TF‑IDF)

TF-IDF is a text mining technique employed to categorize the documents. TF-IDF computes
the weight which represents the importance of a term inside a document [25]. The TF-IDF
value enlists the important words in a document and, it is calculated using (3).
TF-IDF = TF(word) ∗ IDF(word) (3)
This value gives the list of significant words in a document excluding stop words. Accord-
ing to [26] Smoothed TF-IDF value is found to give better results than traditional TF-IDF. We
adopt the smoothed TF-IDF given by (4).
� � � �
tf tk , dj *log nN + 0.01
TF-IDF = �
k
(4)
∑ � � � �
N
��2
kdj tf t k , dj ∗ log n
+ 0.01
k

( ) ( )
where TF-IDF tk , dj is the smoothed TF-IDF value of a word term tk in a text dj , tf tk , dj
is the term frequency value of a word tk in the text dj , N is the total number of tweets in a
text set, and nk is the number of tweets in which tk is included in the text set [27].

3.3 Spectral Clustering

Twitter data analysis is very essential in order to recognize the concealed patterns. Hence, the
first experiments is to cluster the pre-processed tweet documents for identify the best featured
word using spectral clustering. The critical attention is that tweets are verbal to as capricious
blends of words connected to causes and prevention of stroke diseases [28]. Pre-processed
dataset is applied to spectral clustering technique for identify the outliers and grouping the
documents. The similarity between the documents can be established efficiently using Jaccard
coefficient [29]. Jaccard index also known as Intersection over union and Jaccard similarity
coefficient. It is a statistic used for comparing the similarity and diversity of sample sets. This
can be computed using the Formula (5).
A∩B
J(A, B) = (5)
A∪B

13
S. Pradeepa et al.

J(A,B) refers to the similarity between any two documents A and B. The jaccard distance
measures the dissimilarity between sample sets, complementary to jaccard coefficient and
is obtained by subtracting the jaccard index from one (6).
dj = 1 − J(A, B) (6)
Since similar documents have to be grouped, we use jaccard index for computing the simi-
larity between the documents using the TF-IDF values extracted. The similarity matrix is
constructed using the jaccard coefficient to group similar documents for further processing
[30]. The values of similarity matrix ranging from 0 to 1, shows the degree of similarity
between any two documents, where 0 indicates dissimilarity and 1 indicates most similar.
Unlike other algorithms which cluster data points directly in their native data space, graph-
based techniques such as spectral clustering use similarity matrix where ( i, j)—the entry
is some similarity distance defined using the similarity coefficient. This clearly shows the
relationship between only two edges. But for enormous social media content multiple rela-
tionships may exist between every such pair [31].
For spectral clustering, a similarity matrix has to be constructed into similarity graph
with a suitable threshold according to [32]. Each document is a vertex and, two verti-
ces are connected if the similarity value between the two documents is greater than the
given threshold. This similarity graph gives an overall relationship between the docu-
ments. A dense graph indicates that most documents are similar whereas a sparse graph
indicates that documents are less similar [33].
For describing graphs, adjacency matrix (A) and degree matrix (D) can be used. The
Laplacian matrix (L) also called a admittance matrix is an undirected, unweighted graph
without loops. L = D–A gives it. It can measure the extent to which a graph differs at
one vertex from its values of nearby vertices. Eigenvectors of adjacency matrices can be
used to find partitions [34]. The spectral method calculates the cuts in the graph which
is termed as edge separators. Modularity is used to measure the structure of a graph
[35]. The denser graph is said to have higher modularity and the sparse graph is said to
have low modularity. It is calculated as Q = XTLX, where X is the column vector in the
graph of Laplacian [36, 37]. The usage of the normalized adjacency matrix is found to
show better results. The normalized adjacency matrix is computed as A’ = L1/2AL1/2.
Apply k-means clustering algorithm on these features and get k classes from the set of
objects. It is applied repeatedly to further segregate the larger clusters in each level until
proper subsets are obtained.
Spectral clustering provide good accuracy for cluster the document [38]. The clusters
that are irrelevant to our analysis are removed as outliers Apply smoothed TF-IDF from
Formula (4) for each clustered documents for gathering bag of words.
� �
⎛ C1 → w01 , w02 , … , w0n ⎞
⎜ ⋮ ⎟
⎜ � �⎟
⎝ Cn → wn1 , wn2 , … , wnn ⎠

C1, ­C2,…Cn represents the clusters obtained from the spectral clustering. ­W01,W02,…
Wnn are the high TF-IDF words generated from the clusters using smoothed TF-IDF met-
rics. {w1L,w2L,…,wnL}are the important label from the context of spectral clustering and
compare it with different sources {World Health Organization (WHO), Mayo Clinic and
National Health Survey (NHS)}. Effective fine-grained labels of relevant tweets are coco-
nut oil consumption, LDL, Walking style, Family history, Eating habit, gum diseases, HIV,
Vitamin E, Inflammation, Reproductive factor, etc. [39].

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

3.4 Apply PNN

Find the label of the clusters using Probability neural network (PNN). PNN is a feed-
forward neural network developed by Specht [40]. Statistical algorithm kernel function is
implemented in PNN. It consists of four layers (input layer, pattern layer, summation layer
and output layer) and all the operations classified into multi-layered feed forward network
[41]. The major advantages of this classifier includes an inherently parallel structure, fast
training process. It is guaranteed to an optimal classifier as the size of the representative
training set increases. The training samples can be added or removed without extensive
retraining [42]. The Fig. 2 demonstrates the probabilistic neural network classifier.
Algorithm
Input: clusters generated from the spectral clustering
Output: Label of the clusters with highest probability

1. Tokenize the words and perform stemming for the training set and represent them as an
array of lists
2. Assign synaptic weights using random values generated
3. Compute the normalized value of the weighted sum of inputs using sigmoid which is
given by S(x) = 1+e1 −x
4. Compute the confidence of the existing weight using x*(1-x)
5. Train the neural network by adjusting weights until the error value is decreased
6. Classify the input through the neural network to get the probability
7. Assign the label with the highest probability to the cluster.

Group the tweets based on the label given by the clusters. For experiment the Ten-
fold cross validation using the NavieBayes classifier, SVM (Support vector Machine) and
PNN,We extracted 1500 tweets as a training data set and 500 tweets as a test data set. Calcu-
late the precision and recall using these classifier [43, 44]. On comparing, PNN classifier was
found better than others [45]. It achieves the 89.90% accuracy. Seed words taken from the bag

Fig. 2  PNN classifier

13
S. Pradeepa et al.

of words given by the clusters. Class label and featured words are generated from the collec-
tion of seeds [46].

3.5 Apply APRIORI

In this paper, we will use a method of generating the frequent word set using the concept of
APRIORI algorithm. Generate the association rule makes the inferences between the word
set. It has been used to derive feature set from pre-classified text documents.

3.5.1 Association Rule

Let I = {I1, I2, … Im} be a set of m distinct attributes, also called literals. Let D be a data-
base, where each record (tuple) T has a unique identifier and contains a set of items such
that, an association rule is an implication of the form X ⇒ Y , where X, Y ⊂ I , are sets of
items called item-sets, and X ∩ Y = 𝜙 . Here, X is called antecedent, and Y consequent [1].

3.5.2 Support

The support is the percentage of transactions in the database that contain both itemsets X
and Y. The support value of an association rule X → Y is
Support (X → Y) = Support (X ∪ Y) = P(X ∪ Y) (7)

3.5.3 Confidence

The confidence is the percentage of transactions in the database D with itemset X that
also contains the itemset Y. The confidence is calculated using the conditional probability
which is further expressed regarding itemset support. The equation for the confidence is
Support (X ∪ Y)
Confidence (X → Y) = P(Y|X) = (8)
Support (X)

Here, Support (X ∪ Y) is the number of transactions containing both the itemsets X and
Y, Support (X) is the number of transactions containing the itemset X.

3.5.4 Lift

Lift/Interest is used to measure frequency X and Y together if both are statistically inde-
pendent of each other. The lift of rule X → Y is defined as:
Confidence Confidence (X → Y)
Lift (X → Y) = = (9)
Expected confidence Support (Y)

3.5.5 Algorithm

Association Rule Generation

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

Input: The transactions from the file (obtained after preprocessing) as an iterable
object.
Output: Frequent word set and association rule.

1. . Define the following:


a. min_support
b. min_confidence
c. min_lift
d. max_length

2. Generate support records using items in the transaction objects as per Eq. (4)
3. Generate order statistics, calculate the confidence and lift values for the transactions
using results as per Eqs. (5) and (6)
4. Yield results that satisfy the threshold set for min_support, min_confidence, min_lift
and max_length. Generate Relation Records.

From the group of tweets, we generate the frequent item words.


� �
⎛ G1 ∶ fs11 , fs12 , … , fs1n ⎞
⎜ ⋮ ⎟
⎜ � �⎟
⎝ Gn ∶ fsn1 , fsn2 , … , fsnn ⎠

G1,G2,…Gn are the group of the clusters and f­ s11, ­fs12,….fsnn are frequent word set from
the each group. Finally identify the risk factor of stroke can be detected from the frequent
word set.

4 Experimental Analysis

4.1 Data Preprocessing

Around 2000 data (273  KB) was collected from Twitter using Twitter API programs in
Python. Here the tweets are collected by us based on the following keys given by twitter
server, access_token, access_token_secret, consumer_key, consumer_secret. Twitter words
are tokenized using the functions namely, sent_tokenize and word_tokenize using the
NLTK (Natural Language ToolKit) library in Python. The stopwords were also removed by
using stopword package [47]. We collected the data with twitter id and removed duplicate
data when the tweets are generated from the common tweet id. Finally we removed tweet
document when the tweet contains less than three words. This type of preprocessing will
improve the performance of result. We found only 2758 tweets after pre-processing the
document. We used python to import nltk for preprocess the tweets and calculate the TF-
IDF for getting correct featured text in the document

4.2 Spectral Clustering

Jaccard coefficient is used for computing the similarity between two documents. It involves
two steps. First is to find out the common elements of both sets and the next is to find the

13
S. Pradeepa et al.

total elements in both. Finally, ratio of these two gives the jaccard index. The elements
represent the words of the tweets or documents. These values are constructed as a similar-
ity matrix. The similarity matrix is shown in Fig. 3. It defines the similarity value between
every pair of documents or tweets.
This similarity matrix is constructed as a graph and the adjacency matrix of that graph
is given as input to the spectral clustering. The huge cluster at each level is further clus-
tered, and this process is repeated until refined clusters are obtained. After pre-processing
we have 2000 tweets, on which we perform spectral clustering [48]. The number of the
clusters formed, ie. K plays an important role in the DRFS model. With the number of
tweets given we took K as 10. First we performed spectral clustering on the 2000 tweets,
then each cluster undergoes spectral clustering. If a cluster hold greater than 30–40 tweets
it is clustered in the next step (shown in blue colour in Fig. 4). This produces a hierarchical
clustering model. The produced result can be shown in a tree structure like that of Fig. 4.
A cluster which less than 15 tweets is considered to be outliers (shown in red colour in
Fig. 4). After the iterative process is complete it is clusters which have a optimum number
of 30 to 40 tweets are sent for PNN classifier technique, after the hierarchical spectral clus-
tering we have 41 clusters for apriori rule was generated (shown in green colour in Fig. 4).

4.3 PNN Result

PNN works with multiple class labels. The required labels are identified and set of a bag of
words for the label is given for training the PNN. The output is the posterior probability of
each cluster for every class label we assumed [49]. The sample of the output is shown for
41 clusters for the considered class label in the below table. From the higher probability,
we can identify the label of the cluster (Table 1).

4.4 Frequent Word Set

Group the tweets according to the label and find the frequent word set in the group using
Apriori Algorithm. It will generate the list with 70% support and 90% confidence in Table 2.

Fig. 3  Similarity matrix

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

Fig. 4  Result of spectral clustering

Apply APRIORI Algorithm for each group for generating the frequent word set [20]. Word set
of each group is listed in the following Table 2.
In the above frequent word set, the labels considered include all the major factors. Further,
these are generalized into two categories namely Causes and Prevention from the mayo clinic
and WHO data base. Among causes, the major causes are found and its ranked based on the
percentage of tweets under each causes and prevention. This is shown in Fig. 5.
The graph shows that reproductive factors in women such as miscarriage and early meno-
pause cause stroke at a higher rate. The other factors are depicted in the graph include hered-
ity, early puberty, heart disease, inflammation, blood pressure and oral health or gum disease
(Table 3).
Similarly, the same procedure is repeated for finding out preventive measures also. Figure 6
shows the preventive measures.
The graph predicted that coconut oil consumption is the best way to prevent Stroke. The
other preventive measures include proper sleep, being vegetarian, vitaminE consumption,
early morning coffee, early detection and proper dental care.

13

Table 1  PNN results for each cluster
Cluster Coconut LDL and HDL Walking style Family history Eating Gum HIV Vitamin E Inflammation Inflammation Reproductive Others

13
1 0.029 0.002 0.00181 0.0028 0.0018 0.0061 0.021 0.0124 0.0071 0.014 0.0737 0.002
2 0.6173 0.004 0.00096 0.0005 0.0147 0.0006 0.001 0.0079 0.0003 0.0008 0.0004 0.0013
3 0.0026 0.023 0.00038 0.0009 0.0025 0.0013 0.007 0.003 0.0076 0.0024 0.0027 0.0043
4 0.0181 0.001 0.00066 0.0009 0.0018 0.0066 0.0152 0.0077 0.0067 0.0034 0.098 0.0072
5 0.0042 3E−04 0.00114 0.0016 0.002 0.0034 0.0009 0.0021 0.0017 0.0061 0.4467 0.004
6 0.0051 6E−04 0.00025 0.0005 0.0029 0.008 0.0039 0.0056 0.0036 0.0005 0.6058 0.0114
7 0 0.001 0.00055 0.0003 0.0057 0.0131 0.0006 0.0031 0.0009 0.0011 0.0267 0.977
8 0 0.001 0.00088 0.0004 0.0009 0.4185 0.8738 0.0005 0 0.0004 0.0009 0.0016
9 0 0 0.00039 0.0004 0.0002 0.0524 0.8792 0.0003 0.0527 0.0001 0.0001 0.0005
10 0.0015 0.001 0.0014 0.0011 0.0144 0.0035 0.0038 0.0009 0.0001 0.0006 0.0063 0.003
11 0.0035 0.009 0.00537 0.0048 0.0009 0.0011 0.0015 0.0036 0.003 0.0022 0.0005 0.2535
12 0.0035 0.009 0.00537 0.0048 0.0009 0.0011 0.0015 0.0036 0.003 0.0022 0.0005 0.2535
13 0.0063 0.004 0.00469 0.0052 0.0004 0.0981 0.0103 0.0007 0.0005 0.0004 0.0006 0.0773
14 0.0014 0.001 0.00124 0.0007 0.0015 0.0005 0.0017 0.9921 0.0012 0.001 0.0012 0.0062
15 0.0062 0.003 0.00626 0.0053 0.0084 0.0018 0.0106 0.0097 0.0128 0.0507 0.0159 0.0062
16 0.006 0.003 0.00027 0.0009 0.001 0.0012 0.0087 0.0017 0.0075 0.002 0.0073 0.0012
17 0 0 0.00112 0.0014 0.0043 0.0045 0.0006 0.0047 0.009 0.005 0.2134 0.0268
18 0.062 0.012 0.00062 0.0006 0.0024 0.0012 0.0018 0.0168 0.001 0.0014 0.0066 0.0036
19 0.0026 0.023 0.00038 0.0009 0.0025 0.0013 0.007 0.003 0.0076 0.0024 0.0027 0.0043
20 0.0294 0.002 0.00181 0.0028 0.0018 0.0061 0.021 0.0124 0.0071 0.014 0.0737 0.002
21 0.0051 0.005 0.00145 0.0012 0.0022 0.0032 0.0095 0.0029 0.0075 0.0024 0.0045 0.0134
22 0.0058 2E − 04 0.00246 0.0037 0.0015 0.0024 0.001 0.0027 0.0023 0.0345 0.4292 0.0013
23 0.0042 3E − 04 0.00114 0.0016 0.002 0.0034 0.0009 0.0021 0.0017 0.0061 0.4467 0.004
24 0.0051 0.003 0.00086 0.0002 0.8456 0.0117 0.0398 0.0043 0.0007 0 0.0015 0.006
25 0.0013 0.002 0.00132 0.0032 0.0036 0.0071 0.001 0.0024 0.0115 0.0088 0.0734 0.0216
S. Pradeepa et al.
Table 1  (continued)
Cluster Coconut LDL and HDL Walking style Family history Eating Gum HIV Vitamin E Inflammation Inflammation Reproductive Others

26 0.0082 0.006 0.00437 0.0043 0.0022 0.0028 0.0141 0.0047 0.0094 0.01 0.0027 0.0037
27 0.0051 0.005 0.00145 0.0012 0.0022 0.0032 0.0095 0.0029 0.0075 0.0024 0.0045 0.0134
28 0.0005 0.002 0.00058 0.0007 0.0029 0.0043 0.0091 0.0028 0.0027 0.0013 0.0141 0.0061
29 0.0015 0.002 0.00051 0.0011 0.0021 0.95 0.002 0.0003 0.0002 0.0002 0.0033 0.0103
30 0.0117 0.016 0.00322 0.0032 0.0051 0.0061 0.0235 0.0205 0.0061 0.0187 0.0264 0.0069
31 0.0082 0.006 0.00437 0.0043 0.0022 0.0028 0.0141 0.0047 0.0094 0.01 0.0027 0.0037
32 0.0007 0.002 0.36791 0.3836 0.0014 0.0021 0.0006 0.0008 0.0011 0.0005 0.0003 0.0033
33 0.0051 0.005 0.00145 0.0012 0.0022 0.0032 0.0095 0.0029 0.0075 0.0024 0.0045 0.0134
34 0.0035 0.009 0.00537 0.0048 0.0009 0.0011 0.0015 0.0036 0.003 0.0022 0.0005 0.2535
35 0.0009 0.001 0.0003 0 0.0175 0.0123 0.0112 0.0009 0.0001 0.0001 0.0649 0.6734
36 0.0002 3E−04 0.0025 0.001 0.0652 0.0009 0.0054 0.0007 0 0.0006 0.0496 0.0594
37 0.0121 0.027 0.00478 0.009 0.0005 0.0003 0.0079 0.003 0.0018 0.0007 0.0003 0.0321
38 0.0061 0.014 0.00214 0.0018 0.0233 0.0007 0.0017 0.0373 0.0102 0.0008 0.0007 0.2284
39 0.0025 0.002 0.0007 0.0012 0.0021 0.2445 0.0015 0.001 0.0006 0.0004 0.0033 0.0119
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

40 0.0009 0.001 0.00023 0.0002 0.0032 0.2111 0.0086 0.0017 0 0.0001 0.0372 0.5424

13
S. Pradeepa et al.

Table 2  Frequent word set result given by the APRIORI algorithm


Label Frequent word set

Coconut oil consumption {Four weeks consume coconut oil lower risk}
{coconut oil help reduce stroke}
LDL {bad cholesterol higher risk}
Walking style {bad walking style high risk}
Family history {family history high risk}
Eating {Enjoy morning coffee low risk}
{consuming animal products high risk}
{vegan reduce risk}
{vegetables reduce risk}
{consume sodium increase risk}
{proper exercise reduce risk}
{long term alcohol risk}
{fruits reduce stroke risk}
{bulgur reduce risk}
Gum diseases {gum diseases high risk}
{Regular dental care reduce risk}
HIV {HIV patient high risk}
Vitamin E {vitamin E lower risk}
Inflammation {Inflammation lead stroke}
HDL {Good cholesterol lower risk}
Reproductive factor {Early puberty high risk}
{Reproductive factor women tied high risk}
{period before age 12 high risk}
{early menopause miscarriage high risk}
{obesity lead risk}
Heart disease {kill 2000 people per day}
{every 40 s death}
{high blood pressure high risk}
{kidney disease relate stroke}
{health check up aged 40–74 reduce risk}

reproductive factors 35.2


early puberty 13.8
gum disease 26.9
inflammation 3.2
heart disease 4.6
walking style 1.1
bp 15
heredity 0.4
0 5 10 15 20 25 30 35 40

Fig. 5  Causes of stroke

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

Table 3  Categorization of high and low risk factor of stroke


Low risk High risk

{Four weeks consume lower risk} {Early puberty high risk}


{vegan reduce risk} {Reproductive factor women tied high risk}
{proper exercise reduce risk} {period before age 12 high risk}
{fruits reduce stroke risk} {early menopause miscarriage high risk}
{Enjoy morning coffee low risk} {obesity lead risk}
{vegetables reduce risk} {high blood pressure high risk}
{bulgur reduce risk} {kidney diseases relate stroke}
{health check up aged 40–74 reduce risk} {bad cholesterol higher risk}
{regular dental care reduce risk} {bad walking style high risk}
{vitamin E lower risk} {family history high risk}
{good cholesterol lower risk} {consuming animal products high risk}
{consume sodium increase risk}
{long term alcohol risk}
{gum disease high risk}
{HIV patient high risk}
{Inflammation lead stroke}

sleep 0.28

cocunut oil 2.42

dental care 1.11

early detecon 0.97

morning coffee 0.83

vitamin E 0.55

vegan 1.53

regular exercise 0.1

0 0.5 1 1.5 2 2.5 3

Fig. 6  Preventive measures of stroke

5 Conclusion and Future Work

DRFS methodology is very useful for identifying various symptoms and preventive meas-
ures of many other deadly diseases. social media data with DRFS methodology present
exceptional challenges and opportunity for administrative in different domains. We clus-
tered the tweets based on their similarity, which resulted in the generation of required class
labels. From the frequent item set, we detected the risk factor of stroke. Finally it summa-
rizes the causes and prevention of the stroke diseases with suitable ratio. This methodology
helps for movie, mobile, product and document recommendation systems This architecture
has its limits and is found to be suitable and tested only for short text responses. The work
can be extended for data collected from web scraping which is huge text. Further, the bag
of words can be found in the much better efficient way to get more accurate results. We
have three chief future development goals: (1) consideration data from other resources such

13
S. Pradeepa et al.

as articles, private-blogs and forums, images containing texts can also be used for analysis.
(2) We also would like to run the framework on a large scale platform utilizing big data
analytics to process large amounts of data for better inferences. (3) We used the combina-
tion of spectral clustering, PNN, Apriori algorithm to examine the datasets. Instead we like
to use different combinations of machine learning algorithms to see whether their perfor-
mances are relatively better or not. Similar works on this domain will be helpful to improve
the standards of surveillance around the world.

Author contributions  SP is responsible for system implementation and algorithm selection. KRM is respon-
sible for data collection and analysis. SV is responsible for first draft manuscript writing and algorithm veri-
fication. MSK is responsible for Experimental verification and system design manipulation. NC is responsi-
ble for data processing. AKL is responsible for revising the manuscript and algorithm development.

Compliance with Ethical Standards 


Conflict of interest  The authors declare that they do not have any conflict of interests.

Research Involving Human and Animal Rights  This research does not involve any human or animal participa-
tion.

References
1. Yoon S, Elhadad N, Bakken S (2013) A practical approach for content mining of Tweets. Am J Prev
Med 45(1):122–129
2. Hilliard C (2012) Social media for healthcare: a content analysis of M.D. Anderson’s Facebook pres-
ence and its contribution to cancer support systems
3. Lowe R, Bailey TC, Stephenson DB, Graham RJ, Coelho CAS, Sá Carvalho M, Barcellos C (2011)
Spatio-temporal modelling of climate-sensitive disease risk: towards an early warning system for den-
gue in Brazil. Comput Geosci 37(3):371–381
4. Jain VK, Kumar S (2017) Effective surveillance and predictive mapping of mosquito-borne diseases
using social media. J Comput Sci 25:406–415
5. Yoon J, Hagen L, Andrews J, Scharf R, Keller T, Chung E. On the use of multimedia in Twitter health
communication: analysis of tweets regarding the Zika virus.
6. Roberts M, Callahan L, O’Leary C (2017) Social media: a path to health literacy. Inf Serv Use
37(2):177–187
7. Lampos V, Cristianini N (2010) Tracking the flu pandemic by monitoring the social web. In: 2nd IAPR
workshop on cognitive information processing (CIP 2010), IEEE Press, pp 411–416
8. Chunara R, Andrews JR, Brownstein JS (2012) Social and news media enable estimation of epidemio-
logical patterns early in the 2010 Haitian cholera outbreak. Am J Trop Med Hyg 86(1):39–45
9. Aramaki E, Maskawa S, Morita M (2011) Twitter catches the u: detecting influenza epidemics using
Twitter. In: Proceedings of the conference on empirical methods in natural language processing, asso-
ciation for computational linguistics, pp 1568–1576
10. Jain VK, Kumar S (2015) An effective approach to track levels of influenza-A (H1N1) pandemic in
India using twitter. Procedia Comput Sci 70(1):801–807
11. Keller M, Blench M, Tolentino H, Freifeld CC, Mandl KD, Mawudeku A et al (2009) Use of unstruc-
tured event-based reports for global infectious disease surveillance. Emerg Infect Dis 15(5):689
12. Acharya UR, Mookiah MR, Vinitha Sree S, Afonso D, Sanches J, Shafique S, Nicolaides A, Pedro
LM, Fernandes E, Fernandes J, Suri JS (2013) Atherosclerotic plaque tissue characterization in 2D
ultrasound longitudinal carotid scans for automated classification: a paradigm for stroke risk assess-
ment. Med Biol Eng Comput 51(5):513–523. https​://doi.org/10.1007/s1151​7-012-1019-0
13. Kim C, Zhu V, Obeid J, Lenert L (2019) Natural language processing and machine learning algorithm
to identify brain MRI reports with acute ischemic stroke. PLoS ONE 14(2):e0212778

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

14. Oliveira E, Ciarelli PM, Goncalves C (2008) A comparison between a kNN based approach and a PNN
algorithm for a multi-label classification problem
15. Wu D, Yang Q, Tian F, Zhang DX (2010) Fault diagnosis based on K-means clustering and PNN.
In: Third international conference on intelligent networks and intelligent systems
16. Amini L, Azarpazhouh R, Farzadfar MT, Mousavi SA, Jazaieri F, Khorvash F, Norouzi R, Toghian-
far N (2013) Prediction and control of stroke by data mining. Int J Prev Med 4(Suppl 2):S245
17. Twitter Developer Page. https​://dev.twitt​er.com/docs/. Accessed 1 Jan 2015
18. Papadopoulos H, Kyriacou E, Nicolaides A (2016) Unbiased confidence measures for stroke risk
estimation based on ultrasound carotid image analysis. Neural Comput Appl 28(6):1209–1223.
https​://doi.org/10.1007/s0052​1-016-2590-3
19. Zhang Y, Song W, Li S, Fu L, Li S (2018) Risk detection of stroke using a feature selection and
classification method. IEEE Access 6:31899–31907. https​://doi.org/10.1109/ACCES​S.2018.28334​
42
20. Flueckiger P, Longstreth W, Herrington D, Yeboah J (2018) Revised framingham stroke risk score,
nontraditional risk markers, and incident stroke in a multiethnic cohort. Stroke 49(2):363–369.
https​://doi.org/10.1161/STROK​EAHA.117.01892​8
21. Yang X, Li S, Zhao X, Liu L, Jiang Y, Li Z, Wang Y, Wang Y (2017) Atrial fibrillation is not
uncommon among patients with ischemic stroke and transient ischemic stroke in China. BMC Neu-
rol 17(1):207. https​://doi.org/10.1186/s1288​3-017-0987-y
22. Chang CS, Su SL, Kuo CL, Huang CS, Tseng WM, Lin S, Liu CS (2018) Cyclophilin A: a predic-
tive biomarker of carotid stenosis in cerebral ischemic stroke. Curr Neurovasc Res 15(2):111–119.
https​://doi.org/10.2174/15672​02615​66618​05161​20959​
23. Bao Y, Quan C, Wang L, Ren F (2014) The role of pre-processing in Twitter sentiment analysis. In:
ICIC, Taiyuan, China, pp 615–624
24. Zhang X, Attia J, D’Este C, Yu X, Wu X (2005) A risk score predicted coronary heart disease
and stroke in a Chinese cohort. J Clin Epidemiol 58(9):951–958. https​://doi.org/10.1016/j.jclin​
epi.2005.01.013
25. Cho S, Ku J, Cho YK, Kim IY, Kang YJ, Jang DP, Kim SI (2014) Development of virtual reality pro-
prioceptive rehabilitation system for stroke patients. Comput Methods Prog Biomed 113(1):258–265.
https​://doi.org/10.1016/j.cmpb.2013.09.006
26. Haijiao L, Xiaohua K (2015) Study of automated essay scoring based on small dataset extraction
algorithm. In: 2015 4th international conference on computer science and network technology
(ICCSNT)
27. Saba L, Banchhor SK, Londhe N, Araki T, Laird J, Gupta A, Nicolaides A, Suri JS (2017) Web-based
accurate measurements of carotid lumen diameter and stenosis severity: an ultrasound-based clinical
tool for stroke risk assessment during multicenter clinical trials. Comput Biol Med 91:306–317. https​
://doi.org/10.1016/j.compb​iomed​.2017.10.022
28. Park E, Chang HJ, Nam HS (2017) Use of machine learning classifiers and sensor data to detect
neurological deficit in stroke patients. J Med Internet Res 19(4):e120. https​://doi.org/10.2196/
jmir.7092
29. Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S (2013) Using of Jaccard coefficient for
keyword similarity. In: Proceedings of the international multi conference of engineers and com-
puter scientists (IMECS), vol I, Hong Kong, Mar 13–15, 2013
30. Di Iorio B, Di Micco L, Torraca S, Sirico M, Guastaferro P, Chiuchiolo L, Nigro F, De Blasio A,
Romano P, Pota A, Rubino R, Morrone L, Lopez T, Casino FG (2013) Variability of blood pres-
sure in dialysis patients: a new marker of cardiovascular risk. J Nephrol 26(1):173–182. https​://doi.
org/10.5301/jn.50001​08
31. Ab Malik N, Mohamad Yatim S, Lam OL, Jin L, McGrath CP (2017) Effectiveness of a web-based
health education program to promote oral hygiene care among stroke survivors: randomized con-
trolled trial. J Med Internet Res 19(3):e87. https​://doi.org/10.2196/jmir.7024
32. von Luxburg U (2007) A tutorial on spectral clustering. Springer, Berlin
33. Vimal S, Kalaivani L, Kaliappan M, Suresh A, Gao X-Z, Varatharajan R (2018) Development of
secured data transmission using machine learning based discrete time partial observed markov
model and energy optimization in Cognitive radio networks. Neural Comput Appl. https​://doi.
org/10.1007/s0052​1-018-3788-3
34. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Transn Pattern Anal Mach
Intell 22(8):888
35. Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA
103(23):8577–8582. https​://doi.org/10.1073/pnas.06016​02103​12

13
S. Pradeepa et al.

36. Vimal S, Kalaivani L, Kaliappan M (2017) Collaborative approach on mitigating spectrum sensing
data hijack attack and dynamic spectrum allocation based on CASG modeling in wireless cognitive
radio networks. Clust Comput. https​://doi.org/10.1007/s1058​6-017-1092-0
37. Mariappan E, Kaliappan M, Vimal S (2016) Energy efficient routing protocol using Grover’s
searching algorithm for MANET. Asian J Inf Technol 15:4986–4994
38. Ilango SS, Vimal S, Kaliappan M et al (2018) Optimization using Artificial Bee Colony based cluster-
ing approach for big data. Clust Comput. https​://doi.org/10.1007/s1058​6-017-1571-3
39. Kannan N, Sivasubramanian S, Kaliappan M, Vimal S, Suresh A (2018) Predictive big data analytic
on demonetization data using support vector machine. Clust Comput. https​://doi.org/10.1007/s1058​
6-018-2384-8
40. Specht DF (1990) Probabilistic neural networks. Lockheed Missiles & Space Company, Inc.
41. Geetha R, Sivasubramanian S, Kaliappan M, Vimal S, Annamalai S (2019) Cervical cancer identifica-
tion with synthetic minority oversampling technique and PCA analysis using random forest classifier. J
Med Syst 43(9):286
42. Ibrahiem M, El Emary M, Ramakrishnan S (2008) On the application of various probabilistic neural
networks in solving different pattern classification problems
43. O’Brien MK, Shawen N, Mummidisetty CK, Kaur S, Bo X, Poellabauer C, Kording K, Jayaraman A
(2017) Activity recognition for persons with stroke using mobile phone technology: toward improved
performance in a home setting. J Med Internet Res 19(5):e184. https​://doi.org/10.2196/jmir.7385
44. Breccia M, Molica M, Zacheo I, Serrao A, Alimena G (2015) Application of systematic coronary risk
evaluation chart to identify chronic myeloid leukemia patients at risk of cardiovascular diseases during
nilotinib treatment. Ann Hematol 94(3):393–397. https​://doi.org/10.1007/s0027​7-014-2231-9
45. Annamalai S, Udendhran R, Vimal S (2019) Cloud-based predictive maintenance and machine moni-
toring for intelligent manufacturing for automobile industry. In: Novel practices and trends in grid and
cloud computing. IGI Global, pp 74–89
46. Annamalai S, Udendhran R, Vimal S (2019) An intelligent grid network based on cloud computing
infrastructures. In: Novel practices and trends in grid and cloud computing. IGI Global, pp 59–73
47. Tunstall-Pedoe H (2011) Cardiovascular risk and risk scores: ASSIGN, Framingham, QRISK and oth-
ers: how to choose. Heart 97(6):442–444
48. Hu T, Yao L, Reynolds K, Whelton PK, Niu TH, Li SX, He J, Bazzano L (2015) The effects of a low-
carbohydrate diet vs a low-fat diet on novel cardiovascular risk factors: a randomized controlled trial.
Nutrients 7(9):7978–7994. https​://doi.org/10.3390/nu709​5377
49. Pencina M, D’Agostino RB, Larson MG, Massaro JM, Vasan RS (2009) Predicting the 30-year risk
of cardiovascular disease: the framingham heart study. Circulation 119(24):3078–3084. https​://doi.
org/10.1161/CIRCU​LATIO​NAHA.108.81669​4

Publisher’s Note  Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Affiliations

S. Pradeepa1 · K. R. Manjula1 · S. Vimal2 · Mohammad S. Khan3   ·


Naveen Chilamkurti4 · Ashish Kr. Luhach5
S. Pradeepa
pradeepa@it.sastra.edu
K. R. Manjula
manjula@cse.sastra.edu
S. Vimal
svimalphd@gmail.com
Naveen Chilamkurti
N.Chilamkurti@latrobe.edu.au
Ashish Kr. Luhach
ashishluhach@acm.org

13
DRFS: Detecting Risk Factor of Stroke Disease from Social Media…

1
School of Computing, Sastra Deemed University, Thanjavur, India
2
National Engineering College, Kovilpatti, India
3
East Tennessee State University, Johnson City, USA
4
La Trobe University, Melbourne, Australia
5
The PNG University of Technology, Lae, Papua New Guinea

13

You might also like