Real Time Zika Virus Detection System With Unknown Symptoms and Visualization PDF

REAL TIME ZIKA VIRUS DETECTION SYSTEM
WITH UNKNOWN SYMPTOMS AND
VISUALIZATION
By
SRINAGAVALLI NANDIGAM
Bachelor of Technology in Information Technology
Hindustan College of Engineering
CHENNAI, INDIA
2007 - 2011
Submitted to the Faculty of the

Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
December, 2016

ProQuest Number: 10250145

All rights reserved

INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

ProQuest 10250145

Published by ProQuest LLC (2018 ). Copyright of the Dissertation is held by the Author.

All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
REAL TIME ZIKA VIRUS DETECTION SYSTEM
WITH UNKNOWN SYMPTOMS AND
VISUALIZATION
Thesis Approved:
Dr. Johnson P Thomas
Thesis Adviser
Dr. David Cline
Dr. Ronak Etemadpour
ii
ACKNOWLEDGEMENTS
I would first like to thank my advisor Dr. Johnson P Thomas of the Computer Science
Department at Oklahoma State University. He has guided me during the research with his
thoughtful insights and his careful supervision. I would like to thank my committee members Dr.
David Cline and Dr. Ronak Etemadpour for their involvement in the research.
I would like to thank my parents Markandeya Swami Nandigam, Lakshmi Susila Nandigam and
my brother Balakrishna Nandigam for giving me with unfailing moral support to complete the
research.
I would also like to thank my friend Durga Amruth Sagar for helping me in my research. I would
like to thank my friends for all their support and encouragement.
iii
Acknowledgements reflect the views of the author and are not endorsed by committee
members or Oklahoma State University.
Name: SRINAGAVALLI NANDIGAM
Date of Degree: DECEMBER, 2016
Title of Study: REAL TIME ZIKA VIRUS DETECTION SYSTEM: UNKNOWN

SYMPTOMS AND VISUALIZATION
Major Field: COMPUTER SCIENCE
Abstract:
Zika is an infectious disease and there is a need to detect Zika as soon as possible. The
advent of social media provides an opportunity to detect Zika, even before a doctor visit.
In this research, we use twitter tweets to detect Zika. A real time Zika virus detection
system using neural networks has been developed in this work. We use two different
neural networks namely CC4 and MLP. The CC4 neural network helps in detection of
Zika that contains previously unknown symptoms and the Multi-Layer Perceptron neural
network helps in detection of known symptoms of Zika accurately. The outputs from
these two neural networks are used in classification of Zika. Apache spark is used for real
time analysis of twitter data. Once the virus has been detected, the information is useful
only if the data is presented in a form that healthcare providers and others can benefit
from. We developed three different models namely Geographical, Text and Temporal to
visualize the data. Our results show that the Zika virus can be detected with 83%
accuracy using twitter data.
iv
TABLE OF CONTENTS
Chapter Page
I. INTRODUCTION ......................................................................................................1
1.1 Problem Statement .............................................................................................2

1.2 Problems in Earlier Works .................................................................................2
1.3 Proposed Solution .............................................................................................2
II. REVIEW OF LITERATURE....................................................................................3
2.1 Related Work ....................................................................................................3

2.2 Artificial Neural networks (ANN) .....................................................................4
2.3 Neural Networks ...............................................................................................5
2.4 The Artificial Neuron .........................................................................................5
2.5 CC4 Neural Network ........................................................................................6
2.6 Apache Spark .....................................................................................................7
2.6.1 Terminology..............................................................................................8
2.7 Tableau ...............................................................................................................9
III. METHODOLOGY ................................................................................................10

3.1 Approach ..........................................................................................................10
3.1.1 Data Collection ......................................................................................11
3.1.2 Data Preprocessing .................................................................................11
3.1.3 Training ...................................................................................................11
3.1.4 Data Modeling ........................................................................................11
3.2 Dataset ……………………………………………………………………….11
3.2.1 Pre-processing Twitter Data ..................................................................12
3.2.2 Validation of Keywords extracted using Twitter by CDC......................12
3.3 Implementation ...............................................................................................14
3.3.1 Pre-processing of keywords as an input for Neural Networks ………...14
3.3.2 Conversion into Decimal Format ……………………………………....14
3.3.3 Conversion into Floating Point ………………………………………...14
3.3.4 Training and Testing Dataset …………………………………………..15
v
Chapter Page
3.4 Implementation of CC4 in Apache Spark …………………………………...16

3.5 Post-Processing Unit ………………………………………………………...19
3.5.1 Known Symptoms...................................................................................19
3.5.2 Unknown Symptoms…………………………………………………...20
3.6 Data Modeling ……………………………………………………………….21
3.6.1 Geographical Extent …………………………………………………...21
3.6.2 Text Model …………………………………………………………….22
3.6.3 Temporal Extent......................................................................................22
IV. FINDINGS .............................................................................................................23
4.1 Accuracy of Best Percentage Match before Training Unknown Symptoms ...23
4.2 Accuracy of Radius of Generalization ............................................................25
4.3 False Positives and False Negatives ...............................................................25
4.4 Visualization ....................................................................................................27
4.4.1 Geographical Extent Analysis …………………………………………27
4.4.2 Text Model Analysis ..............................................................................30
4.4.3 Temporal Extent Analysis ......................................................................32
V. CONCLUSION ......................................................................................................34
REFERENCES ............................................................................................................35
vi
LIST OF TABLES
Table Page
3.1 Symptoms of Zika recorded by CDC..................................................................13

3.2 Few Examples of Valid Data with keywords extracted ......................................13
3.3 Quantization Mapping Table ..............................................................................15
3.4 Known Symptoms ...............................................................................................20
vii
LIST OF FIGURES
Figure Page
2.1 An Artificial Neuron .............................................................................................6

2.2 General CC4 Network Architecture......................................................................7
2.3 A Simple Spark Topology ....................................................................................9
3.1 Proposed Architecture .........................................................................................10
3.2 Spark Architecture ..............................................................................................16
3.3 Best Percentage Graph for CC4 ..........................................................................19
4.1 Best Match Graph of CC4 Neural Network before Training ..............................24
4.2 Best Match Graph of CC4 Neural Network after Training.................................25
4.3 Accuracy of CC4 for different ROG ...................................................................25
4.4 False Positives and False Negatives for MLP.....................................................26
4.5 False Positives and False Negatives for CC4 .....................................................26
4.6 Geographical Extent of Zika worldwide .............................................................28
4.7 Geographical Extent of Zika in Brazil ................................................................29
4.8 Geographical Extent of Zika in Colombia ..........................................................30
4.9 Text Analysis of Zika..........................................................................................31
4.10 Major Symptom affected – Country .................................................................32
4.11 Zika Timeline Activity......................................................................................33
viii
CHAPTER 1
INTRODUCTION
The Zika virus was discovered in the African continent in the year 1947 and is carried by
monkeys. Since then it has been found that Aedes mosquitoes also carry this virus. The virus
spreads through sex, blood transfusion, and a pregnant woman. During pregnancy, an infected
pregnant woman can pass the virus onto her fetus causing microcephaly in new born babies.
Microcephaly is a birth defect where a baby’s head is smaller in comparison to babies of same
sex and age [1]. In many cases, babies have smaller brains with no proper development.
Microcephaly also causes other problems like vision, hearing loss, improper growth, disability in
learning and problem solving. The major symptoms of Zika virus include fever, skin rash,
headache, joint pain, conjunctivitis, and muscle pain. A large number of people were affected in
Brazil in 2015 and a few cases have been reported in other countries as well in 2016. Therefore,
predicting the potential spread of the disease is very important for timely health care intervention.
In this work, we collect and analyse data from social media to predict the spread of the Zika
virus.
The important goals of this work are to use social media to detect Zika through real time analysis
of Zika virus related tweets. Once Zika has been detected, we need to visualize different attributes
of the extent of Zika spread.
1
1.1 Problem Statement
Infectious diseases like Zika virus should be detected as soon as possible because we cannot wait
till a patient has seen a doctor. Gathering information on spreading virus has been achieved
through social media (Twitter tweets). There are two types of data considered: firstly, symptoms
which are related to Zika and known; secondly, symptoms which may cause Zika but are not
known. Once Zika has been detected, we visualize multiple dimensions of the spread of the
disease such as time dimension, location, and text.
1.2 Problems in Earlier Works
There exist some problems in the earlier works. Existing work [2] to detect the spread of Zika
virus is based on stored datasets using statistical analysis. No real-time analysis has been done so
far. The previous work is not scalable because the use of social media data and stored health data
requires a big data approach. The different dimensions of Zika spread have not been considered in
previous works.
1.3 Proposed Solution
We use of social media and other data have been considered to detect Zika which requires real-
time streaming and analysis of big data. Twitter data is streamed using flume and we developed a
Real-Time Zika detection system using Apache Spark. Apache Spark is a distributed, fault
tolerant, real-time stream processor for big data.
 Multi Layered Perceptron(MLP) neural network is used for detection of
known symptoms as input data. It requires a lot of training and need
reliable symptoms data. MLP neural network uses existing knowledge to
detect Zika.
 CC4 neural network provides an instantaneous response. It requires minimal
training and detects Zika virus even if some of the symptoms are not known.
2
 But the accuracy of CC4 neural network is less compared to MLP neural
network.
 Outputs from these two neural networks (CC4 and MLP) are used to classify
based on the known and unknown symptoms.
To provide Zika information in multiple dimensions to health care providers, data is shown
visually based on:
 Location which provides a geographical extent of the spread.
 Text which provides a textual description of keywords to track the frequency of
words in the tweet texts.
 Date which provide a temporal extent of the spread.
The rest of the document is divided into four sections. Chapter 2 includes a review of literature
and description of neural networks and Apache Spark. Chapter 3 describes the proposed
architecture and a brief description of pre-processing of data and methodology of the Zika virus
detection system. Chapter 4 presents the results of simulations. Chapter 5 concludes the thesis
with the suggestions for future work.
3
CHAPTER 2
REVIEW OF LITERATURE
2.1 Related Work
With real-time streaming of virus infection data, machine learning is used for detection of a virus,
in our case, Zika. Existing techniques to detect the spread of Zika virus is based on statistical
analysis. Our goal is to use machine learning using social media data. Social media provides the
first clues to potential diseases.
In 2016, New England Journal of Medicine has released a report on the Zika virus in Colombia.
They used the national population-based surveillance system to access patients with symptoms of
the Zika virus during August 2015-2016. They also evaluated infected pregnant women test
reports of microcephaly [2]. The research is done primarily on pregnant women with
microcephaly. Data is extracted independently based on study designs, countries key findings,
symptoms, childbirth, and pregnancy [3].
In 2015, Dan Xiao, Dongli Li published an open article on predicting epidemic trends and
evaluating an intervention of Ebola virus disease in 2014-2015 [4]. The Ebola virus spread is
evaluated based on a periodic variation of Ebola disease using differential
4
equations on susceptible, infective, and removed modeling [4]. To predict the transmission
patterns of Ebola disease, they constructed a compartment model. The number of Ebola virus
cases filed and deaths occurred are compared based on the data provided by the World Health
Organization. These models proposed that early detection and diagnosis is required to control
major outbreaks of the Ebola virus disease [4].
In 2013, Cory W. Morin, Andrew C. Comrie, and Kacey Ernst published a paper on how the
Dengue virus has spread widely and affected millions of people [5]. Aedes genus mosquitoes
transmit dengue virus. Analysis has shown that nearly 400 million cases may get recorded per
year [18]. The researchers developed a hypothesized relationship between Aedes mosquitoes,
weather, dengue, and climate. They drew the relationships based on laboratory results and
performed statistical analysis [18]. The test results are generated by analysis between climate and
dengue transmission, laboratory results and field studies on vector and dengue virus. They drew
predictive analysis based on climate data and weather [18].
In 2013, a paper published by Kathy Lee, Ankit Agrawal, Alok Choudhary on real time digital flu
surveillance of United States used data from social media twitter. They built a novel flu
surveillance system that uses twitter data to track flu and cancer activities in real-time [6]. They
have drawn results visually for US disease surveillance maps, distribution and timelines of
disease types, symptoms, and treatments [6].
2.2 Artificial Neural Networks (ANN)
An artificial neural network [8] is a structure of the biological neural network based on
computational functions. A neural network learns depending on the input and output. It functions
like the brain. It is composed of large number of interconnected processing neurons to solve
specific problems. An Artificial Neural Network [] is configured for a specific application, such
5
as pattern recognition or data classification, through a learning process. There are different
types of neural networks, but learning is done in two ways – supervised learning and
unsupervised learning.
In supervised learning, the neural network is provided with both input and output datasets during
training to get the desired outputs. In unsupervised learning, the network will learn on the
characteristics i.e., output will not be known.
2.3 Neural Networks
Neural networks are widely used in pattern recognition because of their ability to generalize and
to respond to unexpected inputs/patterns. Usually, the neural networks will have three layers –
input layer, hidden layer, and an output layer. The input layer is connected to the hidden layer
and the hidden layer is connected to the output layer. But some neural networks will not have
hidden layer like Perceptron for example.
Neural networks learn over the time by training. During training, neurons are taught to recognize
various specific patterns and whether to fire or not when that pattern is received.
2.4. The Artificial Neuron
An artificial neuron [7] is a mathematical function conceived as a model of biological neurons.
Artificial neurons are the constitutive units in an artificial neural network. The artificial neuron
receives one or more inputs (representing dendrites) and sums them to produce an output
(representing a neuron’s axon) [22]. In a biological neuron, we have three important types:
dendrites, soma, axon.
Dendrites receive signals from other neurons. The signals are electric impulses that are
transmitted across a synaptic gap. The soma sums the incoming signals i.e., the input signals
6
multiplied by the weights. When sufficient input is provided, the cell fires. When the sum of
values is greater than or equal to a threshold value, then the cell fires the output [23].
Figure 2.1: An Artificial Neuron
2.5 CC4 Neural Network
The CC4 Neural Network is an Instantaneously trained neural network proposed by Kak [10]
[11]. CC4 is a feed-forward neural network. CC4 requires fast learning because the biological
neurons produce instantaneous results. It has three layers which are
 Input layer
 Hidden layer
 Output layer
The input layer takes its input in unary format. All the inputs converted into unary format. For
each input data, consider a biased neuron which set to 1. The weights are assigned from the input
layer to hidden layer. All the neurons in the input layer and hidden layer are fully connected. All
neurons in hidden layer correspond to a single training data in the training dataset. The output
layer provides the output to the network. As the input layer and hidden layer are connected, the
7
hidden layer and output layer are also fully connected [12]. The CC4 neural network general
architecture is shown below figure [12] [13]:
Figure 2.2: General CC4 Network Architecture
It uses a concept known as radius of generalization. This helps in classification of input vectors
based on the class of stored vectors. If the hamming distance between the new input vector and
any of the stored vectors is less than or equal to the user-specified radius, the outputs of all such
stored vectors is considered for generating the output of the input vector. The number of 1s and 0s
in every bit location of the output vector of all these stored vectors is calculated and added up. If
the result is positive, the corresponding output neuron outputs 1 otherwise the output is 0 [11]
[12].
2.6 Apache Spark
Apache Spark [14] is a cluster computing technology, designed for fast computation. It is based
on Hadoop Map Reduce and it extends the Map Reduce model to efficiently use it for more types
of computations, which includes interactive queries and stream processing. The main feature of
8
Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
2.6.1 Terminology
Application
Application is a user program built on Spark that consists of a driver program and executors on
the cluster.
Driver Program
Driver program is the process running the main () function of the application and creating the
Spark Context.
Cluster Manager
Cluster manager is an external service for acquiring resources on the cluster.
Worker Node
Worker node is any node that can run application code in the cluster.
Executor
Executor is a process launched for an application on a worker node, that runs tasks and keeps data
in memory or disk storage across them. Each application has its own executors.
Task
Task is a unit of work that will be sent to one executor.
Job
Job is a parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect).
Stage
9
Stage is about each job being divided into smaller sets of tasks called stages that depend on each
other (like the map and reduce stages in Map Reduce).
Figure 2.3: A Simple Spark Topology
2.7 Tableau
Tableau [20] not only deals with creating a visualization of data but also analyzes
it and use various forecasting and churn analysis methods. Tableau helps business users to draw
better insights to visualize data efficiently. It connects almost to all the available data sources via
pre-built data connectors, both matrix format and multi-dimensional formats, and helps create
instantaneous dashboard visualizations in less time compared to conventional methods.
Aesthetics add to the functionality as Tableau provides the ability to change layout, colors, and
alignments and efficiently for huge amount of too [21].
10
CHAPTER 3
METHODOLOGY
3.1 Approach
Figure 3.1: Proposed Architecture
We propose a real-time system in which Apache Spark works as a Real-time streaming processor.
The proposed system consists of 4 steps:
 Data Collection
 Data Pre-processing
11
 Training
 Data Modeling
3.1.1 Data Collection
Tweet text is a short text message limited to 140 characters in length posted by users on Twitter.
Data related to Zika Virus is collected using Twitter. Apache Flume is used to retrieve data from
Twitter using keywords. This twitter text will be in JSON format and contains tweet text,
username, time-stamp, and location related to Zika virus.
3.1.2 Data Preprocessing
The data preprocessor module will convert these data to text format. We need to
convert the data related to Zika virus into Decimal format (works for MLP) and into
Unary format (works for CC4).
3.1.3 Training
The proposed system consists of two types of Neural Network Training:
The CC4 neural network is an instantaneously trained neural network. MLP neural network is a
two-layered feed forward neural network using Back Propagation Technique to train the network.
3.1.4 Data Modeling
Once the Zika virus has been detected, we visualize the data to provide useful Zika information to
health care providers. These visualization models include,
 Geographical Extent to track the spread of Zika by geographic region by
measuring the volume of Zika tweets generated.
 Text Model to discover useful information related to symptoms of Zika.
 Temporal Extent to track the volume changes of the tweets over time
3.2 Dataset
We use the Twitter for Zika Virus Detection System streamed using flume. It is the
12
benchmark dataset collected from January to July. It is a labeled dataset consisting of
Zika Virus with Zika #tag.
Part of the dataset (which is most recent) is reserved for validating the model and is not
used in the training process. We use 60% of Zika tweets and non-Zika tweets for training
and remaining 40% of Zika and non-Zika tweets for testing. Our dataset consisted of
almost 2 million tweets collected from Jan 2016 to July 2016.
3.2.1 Pre-processing Twitter Data
Data obtained from the twitter application will be in JSON format. We filtered the JSON file to
normal text file by using specific keywords (username, location, tweets, symptoms, timestamp).
The text file will have some irrelevant data because keywords specified can occur in multiple
contexts.
3.2.2 Validation of Keywords extracted using Twitter by CDC
We referred to the CDC [18] website for the validation of Zika keywords. Using these keywords,
we extracted the keywords from the Twitter dataset. Table 3.1 represents the symptoms that are
recorded by CDC and considered only those symptoms (dataset) for training and testing purposes.
This serves as a ground truth in our implementation [18].
Serial Number Symptoms of Zika recorded by CDC
1 Fever
2 Rash
3 Joint pain
4 Muscle pain
5 Conjunctivitis
13
6 Headache
7 Microcephaly
Table 3.1: Symptoms of Zika recorded by CDC
 Table 3.2 shows examples of tweets mentioning keywords listed in Table
3.1. Many users describe their Zika symptoms.
Data Keywords Extracted Location
Yo…Here, I come from playing football and I Rash Nicargua

discover that I have RASH ...#Zika, are
you?nicargua
This is Florida: Rash-check. Conjunctivitis- Rash Florida
check.Fever- check. Been out of country- Conjunctivitis
negative.Prescription- Prednisone. #Zika Fever
Salmon to red coloured, maculopapular rash" Red Eyes California

marked #Sandi ego’s #Zika #sex case
California, US
#Zika Acute signs & symptoms: Joint Pain Joint Pain United States
Achy, fever, rash and that's it. How goofy is she. Fever La liberated
#Zika WLa Liberated, El Salvador Rash El Salvador
Cra. Rosario: #Minsa reports the first case of Microcephaly Vichida
Microcephaly with a pregnant sister who went to Colombia
private practice ... #Zika and rash Vichada,
Colombia
Today Adriana got the joint pain, #Zika story Joint Pain Dominican
Dominican
My whole family already had #Zika. You may not Rash La_Union
experience all the symptoms. Most certainly rash, Fever
fever and headache.La_Union Head Ache
Um, uh. Thought pinpricks on my arm were from Rash Colombia
cat claws while I slept. Looking at it now, looks
like rash and found I have Zika #Zika
Colombia
Table 3.2: Few Examples of Valid Data with keywords extracted
14
3.3 IMPLEMENTATION
3.3.1 Pre-processing of keywords - input for Neural Networks
In this process, we convert the input dataset to unary code and decimal format. CC4
neural network accepts only unary code as input. The MLP neural network accepts
input in decimal format.
3.3.2 Conversion into Decimal Format
The dataset in the text format is converted into decimal format. To convert, for each keyword
extracted during preprocessing we will assign a unique ID starting from 0. This is different for
each keyword in the dataset, resulting in the format where a dataset contains the values for each
keyword indicating its unique identification number. Among the two Neural networks we are
using in our implementation, MLP requires input to be in the decimal format so there is no need
for further processing of data for MLP. The other Neural Network, CC4 requires the input in
unary format and hence it needs to undergo preprocessing. Direct conversion from decimal to
unary is not possible. So, we need to convert the decimal format to floating and then to unary.
3.3.3 Conversion into Floating point
To convert the decimal format data into unary format for the CC4 network. We calculate the ratio
of a total number of Zika viruses detected that contains the symptom to that of total viruses.
The floating ratios range from 0.0 to 0.9. For each keyword, we will assign a unique id.
We generate a quantization mapping table based on floating values. The mapping table includes a
unique unary value for all the floating values. Therefore, all the keywords will be assigned with a
unary value.
Range Mapping
0.0 0000000000000000
15
0.00000001 - 0.0000001 0000000000000001
0.0000001 - 0.000001 0000000000000011
0.000001 - 0.00001 0000000000000111
0.00001 - 0.0001 0000000000001111
0.0001 - 0.001 0000000000011111
0.001 - 0.01 0000000000111111
0.01 - 0.1 0000000001111111
0.1 - 0.3 0000000011111111
0.3 - 0.55 0000000111111111
0.55 - 0.9 1111111111111111
Table 3.3: Quantization Mapping Table
3.3.4 Training and Testing Dataset
The proportions of Zika and non-Zika data we use are 70% and 30 % respectively. Part of the
dataset (most recent) is reserved for validating the model. For training and testing, the proportions
considered are 60 % and 40 % respectively.
In this process, the CC4 is efficient in finding Zika known and unknown symptoms. But the
correctness of the results has a low percentage. To improve the accuracy of the results, we
consider the output of MLP [19]. The MLP output will be compared with CC4 output in the post-
processing unit.
16
3.4 Implementation of CC4 in Apache Spark
Figure 3.2: Spark Architecture
The dataset is converted into unary in the pre-processing phase. CC4 which is an instantaneously
trained neural network can detect Zika even if some of the symptoms are unknown. In the CC4
training process, all the training inputs are processed using CC4 training algorithm.
During training, the Spark driver application has its own executors on the cluster which remain
running as long as the Spark driver application has the spark context. Spark context is a main
entry point for spark functionality and represents the connection for the cluster. The connection is
to establish spark cluster (we can use spark context or spark session). The cluster manager sends
the processed data to the executor nodes. Once the input data is received, the number of 1’s in
each input data is considered. The bias neuron has its constant value set to 1. For each training
data, the weights from input to hidden layer are computed. For each input bit, if it is 1 then assign
weight=1 else assign weight= -1. We need to choose the proper radius of generalization. The
radius should be chosen carefully because it is a user defined value where the output value
changes depending on it. For our input r =2, CC4 has better accuracy in detecting Zika. We
calculate the weight for the bias neuron using the formula weight=r-s+1 where s is the number of
17
1’s and r is the radius. The weights are multiplied with the input. The weights from the hidden
layer to the output layer are computed. For each input bit, if it is 1 then assign weight=1. If it is
not 1 then assign weight= -1. The weights are multiplied and added together. If the summation
value is above the threshold, a 0 is output else a 1 is output. The obtained result is stored as a CC4
trained data.
In the testing process after the output using the testing data will be compared with the actual
values of whether it is Zika or not. For different number of inputs, the best percentage match of
the output is calculated. For example: if we have 40 input data, and only 35 give the correct
output, or the percentage match accuracy is 35/40, i.e., 87%. After running on different inputs, we
found that the best percentage occurred at 80%. If the best percentage match is above the
threshold, it outputs as 0. If the best percentage match is below the threshold, CC4 outputs the
result as 1. Then the executor node sends this output to the post-processing unit for the
classification of Zika using known and unknown symptoms.
Implementation for training and testing:
The entry point into all functionality in Spark is the SparkSession class (spark context).
To create a Spark Session, we use Sparksession.builder().
SparkSession spark = SparkSession.builder().master("local[*]")

.appName("CC4")
.getOrCreate()
Set the path for data using Java RDD (Resilient Distributed Datasets)
JavaRDD<String> lines = spark.read().textFile().javaRDD(
Map the input which is comma seperated and load into RDD
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()

Arrays.asList(SPACE.split(s)).iterator();
Map RDD for the first iteration of layers
JavaPairRDD<String, String> ones = words.mapToPair(new PairFunction<String, String,

String>()
18
Repeat until all input data is executed
Count the number of ones (countones) in each input data for biased neuron
Assign the weights to each input data for input layer
for k = 1 to length-1 do
if (temp[k-1].equals("1"))
assignedforeachbit[k] = 1
countones++
else
assignedforeachbit[k] = -1
For the biased neuron, calculate weight using formula.

r is the radius of generalization (user defined)
Initially biasfinalweight = 0
biasweight = r – (countones) + 1
if (biasweight == 1)
biasfinalweight = 1
else
biasfinalweight = -1
To compute the activation function

weightsafteroriginalmultiply = biasweight * biasfinalweight;
Map rdd for second layer
JavaPairRDD<String, String> weightsofweightsafteroriginalmultiply = ones.mapToPair(new
PairFunction<String, String, String>()
Output 0 for known symptom and 1 for unknown symptom
End repeat
19
Best Percentage Graph for CC4
120
100
Best Percentage Match
80
60
Best Match
40
20
0
0 50 100 150 200 250 300 350 400 450
Number of input data
Figure 3.3: Best Percentage Graph for CC4
3.5 Post-Processing Unit
The outputs from both the neural networks MLP and CC4 are input to the post-processing unit.
The executor node contains the output from CC4 and MLP. We will create a data frame to
connect to the outputs. Once it is created, the CC4 output and MLP output values are compared.
Once the CC4 output indicates 1, the CC4 executor node will call MLP-2 and train for the new
symptom. The post-processing unit will now detect whether the input is known or unknown.
3.5.1 Known Symptoms
In this, both the outputs from CC4 and MLP are compared, and the post-processing unit will give
results as below:
CC4 Output MLP Output [19] Final Output
0 0 Zika
0 1 Non-Zika
20
1 0 Unknown
1 1 Unknown
Table 3.4: Known Symptoms
3.5.2 Unknown Symptoms
As discussed earlier, if the CC4 output indicates 1 irrespective of the MLP output, the post
processing unit classifies the disease as unknown. If the CC4 outputs 1, we introduce a new MLP
in the executor node called MLP-2 which works offline and implement this separately. MLP-2 is
now trained with the new unknown symptoms that were input to CC4 which classified the input
data as Zika. MLP-2 is therefore now trained to detect Zika for both known and previously
unknown symptoms and it goes online and MLP 1 goes offline. and is updated so that the
actual MLP will detect this new symptom and now will be able to detect this symptom as
it is no more an unknown symptom.
Pseudocode: for all the outputs,
do
create the data frame for the outputs generated by CC4 and MLP
DataFrame dataFrame = sqlContext.createDataFrame(data,

LabeledPoint.class)
Mapping output from CC4 and MLP
val cc4 = seq (“0”, “1”). map (Tuple1.apply)
val mlp=seq (“0”,”1”). map (Tuple2.apply)
compare both output values
if (cc4==0 && mlp==0)
output “0” indicating presence of zika
else if (cc4==0 && mlp==1)
21
output “1” indicating presence of non-zika
else if (cc4==1 &&mlp==1|0) //Indicates Unknown symptom
Call MLP-2 and implement MLP separately, train the unknown symptom and
update [19]
MultilayerPerceptronClassifier trainer = new

MultilayerPerceptronClassifier ()
Train the data
MultilayerPerceptronClassificationModel model = trainer. Fit(train)
3.6 Data Visualization
Since the Neural network only detects the presence of Zika virus and it does not give any
information in which region it has spread, the timeline or the symptoms. We generate a visual
representation of the Geographical extent by extracting the location in our dataset to provide an
Activity Map (Zika virus in different locations), text model that shows symptoms and the
temporal extent to track the volume changes of the Zika tweets over time.
3.6.1 Geographical Extent
The goal of visualizing the geographical extent is to track the spread of Zika virus by geographic
region using the Zika tweets. A tweet or a user can have two types of location, a text-based user
profile location or a sensor-based geolocation. User profile location is a user-entered random text
that they declare as their home location. Sensor-based tweet location is an actual geo-location
(with longitude and latitude values) of a user provided with a tweet [6]. We consider the user
profile location. We will ignore invalid information such as any text other than country/state
name. The location information is chosen based on country name and state name.
Tableau is used to show the map distribution. Once we connect to the Tableau server,
 We get the output from the neural networks, then we will create a csv file
which contains the information about Zika virus detection.
22
 Connect to data using our input data .csv file.
 A work dashboard will load for our input data.
 Mark the county name to a geographical role.
 The map will be generated based on longitude and latitude values (country/
state name).
 Filter the unrecognized data by selecting the respective country/state name in
place of default location given.
 In measures, select the value as, i.e., a number of Zika tweets for each location
is selected.
 A map will generate for all the countries and Country with their respective
states.
3.6.2 Text Model
The goal of text modeling is to find useful health information. We are interested in investigating
Zika symptoms [6]. As discussed in the dataset collection, we create a keyword list. For example,
the keyword list for Zika virus is fever, muscle pain, conjunctivitis, joint pain, microcephaly,
headache, etc. For each symptom, we count the number of tweets that contain. We can create pie
charts, bar charts using visualization tool tableau.
3.6.3 Temporal Model
The goal of temporal modeling is to track the volume changes of the Zika tweets over time. The
volume change of keyword `Zika' over time is a good reflection of the Zika activity level change
over time [6]. For the temporal model, we count the number of Zika virus related tweets
generated. This data is used to create the Zika activity level timeline.
23
CHAPTER 4
FINDINGS
4.1 Accuracy of Best Percentage Match before Training Unknown Symptoms
While testing the detection of Zika virus using CC4, if the best percentage match falls below 80%
(Threshold), it is a sign that there is an unknown symptom in the data. To test this, data that
contains previously some of the unknown symptoms are input to the CC4 network. Out results
show that the best percentage match fell below 80% signaling it as an unknown symptom. Later
we trained the MLP for these unknown symptoms. After training, the best percentage match with
these new symptoms rose above 80% indicating the MLP is trained for the new symptom. The
previously unknown symptoms in the data include symptoms like microcephaly.
24
Figure 4.1: Best Match Graph of CC4 Neural Network before Training
Figure 4.2: Best Match Graph of CC4 Neural Network after Training
25
4.2 Accuracy of Radius of Generalization
CC4 provides instantaneous results with minimal training. CC4 uses a radius of generalization
while calculating the weight for bias neuron where it differentiates the trained symptoms and new
symptoms. We tested the accuracy by providing different user defined radius of generalization
values ROG = 0, 1, 2. Different radius values are trained and tested with a different number of
input data. We found that the CC4 gives the best accuracy when ROG = 2. The x-axis is input
data which contains both symptoms for zika and non-zika.
CC4 Accuracy with selection of Radius of

Generalization
120
100
Percentage Accuracy
80
60
40
20
0
0 20 40 60 80 100 120
Total number of viruses
ROG=0 ROG=1 ROG=2
Figure 4.3: Accuracy of CC4 for different ROG
4.3 False Positives and False Negatives
Not all recognition systems are 100% accurate, since they are trained using data which may or
may not match with the data we use for testing. Therefore, there are some possibilities for false
positives and false negatives. In our thesis, we are using CC4 and MLP neural networks for
detection of Zika based on symptoms related to Zika. To test the false positives and false
26
negatives, we use 400 random inputs from the dataset which constitutes 180 known symptoms
and 220 unknown symptoms. The MLP neural network shows 0% false positive and 0% false
negative which is shown in figure 4.4. The CC4 neural network has 5.12% false positive and 0%
false negative which is shown in figure 4.5.
False Positive and False Negative for MLP

1.2
0.8
Output (0/1)
0.6
0.4
0.2
0
0 50 100 150 200 250 300 350 400 450
-0.2
Number of input values
Figure 4.4: False Positives and False Negatives for MLP
False Positive and False Negative for CC4

1.2
0.8
Output (0/1)
0.6
0.4
0.2
0
0 50 100 150 200 250 300 350 400 450
-0.2
Number of input values
27
Figure 4.5: False Positives and False Negatives for CC4
4.4 Visualization
We obtained the Zika detection results in the form of 0 and 1 which indicates the presence of Zika
and non-Zika virus respectively. To visualize these results and provide them to the health care
providers in an efficient manner, we have developed three visualization models
 Geographical Extent
 Text Model
 Temporal Extent
4.4.1 Geographical Extent Analysis
We generated the country/state name and percentage of Zika tweets in the given country/state.
For each country, the color shade varies. The darker locales show a high percentage of Zika virus.
The lighter locales show a low percentage of Zika virus spread. The results show the percentage
of Zika present when the cursor is placed on country/state name in Tableau.
28
Figure 4.6: Geographical Extent of Zika worldwide
The below map show the Brazil distribution with the percentage of Zika virus present with the
state name. The least spread of the virus is in Roraima and Rio_de_Janeiro state with 0.03%.
Bahia state has the highest percentage of Zika spread, i.e., 14.39% in Brazil.
29
Figure 4.7: Geographical Extent of Zika in Brazil
The below map shows Colombia Zika activity with valid state names. The highest percentage
spread of Zika virus in a single state in Colombia is 18.5% where as in some states Zika is not
present at all.
30
Figure 4.8: Geographical Extent of Zika in Colombia
4.4.2 Text Model Analysis
In text analysis, we revealed the percentile of Zika symptoms by investigating the contents of
tweet results from neural networks. From the pie chart using Tableau, it observed that the
microcephaly symptom has the high number (40.68 %) of tweets generated followed by fever
(12.30%) and rash (12.13%).
31
Figure 4.9: Text Analysis of Zika
The below chart shows the major symptoms which affected a country. Brazil has more tweets on
the microcephaly symptom. It also tells that among all the major symptom which affected each
country, microcephaly has more tweets generated. The darker color shows the greater Zika
related tweets. The lighter color displays the fewer number of Zika tweets. The dark blue value
range from 9866 – 5110, the middle blue range from 5110 – 2500, and the light blue range from
2500 – 24.
32
Figure 4.10: Major Symptom affected - Country
4.4.3 Temporal Extent Analysis
We counted the number of Zika related tweets generated from the month of January to July with
the associated keywords. This data is used to create the Zika activity level timeline. The below
chart show the percentage of the sum of tweets for every month. The month of January has the
lowest extent of Zika spread. There is a drastic change in tweets related to Zika generated from
April to May.
33
Figure 4.11: Zika Timeline Activity
34
CHAPTER 5
CONCLUSION
An efficient real time Zika detection system has been developed using Apache Spark. The
detection mechanism in the proposed model provides instantaneous and accurate results because
it uses the CC4 instantaneous neural network and the multi layered perceptron neural
network. Different models have been developed to visualize the extent of the geographical, text,
and temporal spread of Zika.
For future work, we can apply Neural Network techniques for detecting the spread of other types
of diseases. Instead of twitter data we can consider datasets that include medical and behavioral
interventions to provide more accuracy. We can implement this work in real world. Other types
of visualization like QlikView using more dimensional data can be used.
35
REFERENCES
[1] Zika Virus, https://www.cdc.gov/Zika/healtheffects/birth_defects.html, November 22
2016.
[2] Oscar Pacheco, Maurico Beltran. Zika Virus Disease in Colombia – Preliminary Report.
The New England Journal of Medicine, DOI: 10.1056/NEJMoa1604037, June 15 2016.
[3] Chibueze EC, Tirado V, da Silva Lopes K, Balogun OO, Takemoto Y, Swa T. Zika
virus infection in pregnancy: a systematic review of disease course and complications.
Bulletin World Health Organ., June 9 2016.
[4] Zuiyuan Guo , Dan Xiao , Dongli Li , Xiuhong Wang, Yayu Wang, Tiecheng
Yan, Zhiqi Wang., Predicting and Evaluating the Epidemic Trend of Ebola Virus
Disease in the 2014-2015 Outbreak and the Effects of Intervention Measures
Published: April 6, 2016http://dx.doi.org/10.1371/journal.pone.0152438, April
2016
[5] Cory W. Morin, Andrew C. Comrie, Kacey Ernst. Climate and Dengue Transmission:
Evidence and Implications, Environmental Health Perspectives, Vol 121, Issue 11-12,
November-December 2013.
[6] Kathy Lee, Ankit Agrawal, Alok Choudhary. Real-Time Digital Flu Surveillance using
Twitter Data. Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining (KDD ’13), pp. 1474-1477, 2013.
36
[7] Igor Aleksander, Helen Morton, An Introduction to Neural Computing, Intl. Thomson
Computer Pr(T), October 1995.
[8] Artificial neuron. https://en.wikipedia.org/wiki/Artificial_neuron. Last accessed October
29, 2016.
[9] Kaushik Bose, An Introduction to Artificial neural network,
https://www.academia.edu/7468404/Artificial Neural Network, last accessed October 29,
2016.
[10] Goutam Mylavarapu. Instantaneous Intrusion Detection System. Master’s thesis,
Department of Computer Science, Oklahoma State University, 2015.
[11] Subhash Kak. New Algorithms for training feedforward neural networks. Pattern
Recognition Letters, Vol 15, No.3, 1994.
[12] Sumanth Reddy. Generalization and Efficient implementation of CC4 Neural Network
Master’s thesis, Department of Computer Science, Oklahoma State University, 2008.
[13] Kun-Won Tang and Subhash C Kak, “A new corner classification approach to neural
network training”, Circuits, Systems and Signal Processing, Vol.17, No.4, pp. 459 469,
1998.
[14] Rohit Pillay. Instantaneous Intrusion Detection System. Master’s thesis, Department of
Computer Science, Oklahoma State University, 2010.
[15] Apache spark. https://spark.apache.org/. Last accessed 29 November 2016.
[16] Wikipedia. Spark https://en.wikipedia.org/wiki/spark. Last accessed November 18, 2016.
37
[17] Holden Karau, Andy Konwinski, Parick Wendell and Matei Zaharia. Getting Started with
Spark. O’Reilly, 2015.
[18] Zika, http://www.cdc.gov/Zika/about/overview.html . Last accessed October 11, 2016.
[19] Durga Amruth Sagar, Real-Time Zika Virus Detection System With Known Symptoms
and Prediction. Master’s thesis, Department of Computer Science, 2016.
[20] Tableau. https://interworks.co.uk/business-intelligence/why-tableau/. Last accessed May
30, 2016.
[21] Katherine Noyes, Tableau’s new BI analytics Suite.
http://www.cio.com/article/2900253/six-features-coming-to-tableaus-new-bi-analytics-
suite.html. Last accessed March 20, 2016.
38
VITA
Srinagavalli Nandigam
Candidate for the Degree of
Master of Science
Thesis: REAL TIME ZIKA VIRUS DETECTION SYSTEM WITH UNKNOWN

SYMPTOMS AND VISUALIZATION
Major Field: COMPUTER SCIENCE
Biographical:
Education:
Completed the requirements for the Master of Science in Computer Science at

Oklahoma State University, Stillwater, Oklahoma in December, 2016.
Completed the requirements for the Bachelor of Technology in Information

Technology at Hindustan College of Engineering, Chennai, India in May, 2011.
Experience:
Professional Memberships:

Real Time Zika Virus Detection System With Unknown Symptoms and Visualization PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Real Time Zika Virus Detection System With Unknown Symptoms and Visualization PDF

Uploaded by

Copyright:

Available Formats

REAL TIME ZIKA VIRUS DETECTION SYSTEM

WITH UNKNOWN SYMPTOMS AND

Bachelor of Technology in Information Technology

Hindustan College of Engineering

Submitted to the Faculty of the

WITH UNKNOWN SYMPTOMS AND

Dr. Johnson P Thomas

Dr. David Cline

Dr. Ronak Etemadpour

like to thank my friends for all their support and encouragement.

Date of Degree: DECEMBER, 2016

Title of Study: REAL TIME ZIKA VIRUS DETECTION SYSTEM: UNKNOWN

Major Field: COMPUTER SCIENCE

accuracy using twitter data.

1.1 Problem Statement .............................................................................................2

II. REVIEW OF LITERATURE....................................................................................3

2.1 Related Work ....................................................................................................3

III. METHODOLOGY ................................................................................................10

3.4 Implementation of CC4 in Apache Spark …………………………………...16

IV. FINDINGS .............................................................................................................23

3.1 Symptoms of Zika recorded by CDC..................................................................13

2.1 An Artificial Neuron .............................................................................................6

of the extent of Zika spread.

disease such as time dimension, location, and text.

1.2 Problems in Earlier Works

1.3 Proposed Solution

tolerant, real-time stream processor for big data.

 Multi Layered Perceptron(MLP) neural network is used for detection of

known symptoms as input data. It requires a lot of training and need

reliable symptoms data. MLP neural network uses existing knowledge to

 CC4 neural network provides an instantaneous response. It requires minimal

based on the known and unknown symptoms.

visually based on:

 Location which provides a geographical extent of the spread.

 Text which provides a textual description of keywords to track the frequency of

words in the tweet texts.

 Date which provide a temporal extent of the spread.

with the suggestions for future work.

2.1 Related Work

first clues to potential diseases.

symptoms, childbirth, and pregnancy [3].

evaluated based on a periodic variation of Ebola disease using differential

major outbreaks of the Ebola virus disease [4].

predictive analysis based on climate data and weather [18].

disease types, symptoms, and treatments [6].

2.2 Artificial Neural Networks (ANN)

characteristics i.e., output will not be known.

2.3 Neural Networks

hidden layer like Perceptron for example.

2.4. The Artificial Neuron

An artificial neuron [7] is a mathematical function conceived as a model of biological neurons.

dendrites, soma, axon.

Figure 2.1: An Artificial Neuron

2.5 CC4 Neural Network

neurons produce instantaneous results. It has three layers which are

architecture is shown below figure [12] [13]:

Figure 2.2: General CC4 Network Architecture

2.6 Apache Spark

respective system, it reduces the management burden of maintaining separate tools.

Cluster manager is an external service for acquiring resources on the cluster.

Task is a unit of work that will be sent to one executor.

action (e.g. save, collect).

other (like the map and reduce stages in Map Reduce).