You are on page 1of 48

REAL TIME ZIKA VIRUS DETECTION SYSTEM

WITH UNKNOWN SYMPTOMS AND

VISUALIZATION

By

SRINAGAVALLI NANDIGAM

Bachelor of Technology in Information Technology

Hindustan College of Engineering

CHENNAI, INDIA

2007 - 2011

Submitted to the Faculty of the


Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
December, 2016




ProQuest Number: 10250145




All rights reserved

INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.






ProQuest 10250145

Published by ProQuest LLC (2018 ). Copyright of the Dissertation is held by the Author.


All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.


ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
REAL TIME ZIKA VIRUS DETECTION SYSTEM

WITH UNKNOWN SYMPTOMS AND

VISUALIZATION

Thesis Approved:

Dr. Johnson P Thomas

Thesis Adviser

Dr. David Cline

Dr. Ronak Etemadpour

ii
ACKNOWLEDGEMENTS

I would first like to thank my advisor Dr. Johnson P Thomas of the Computer Science

Department at Oklahoma State University. He has guided me during the research with his

thoughtful insights and his careful supervision. I would like to thank my committee members Dr.

David Cline and Dr. Ronak Etemadpour for their involvement in the research.

I would like to thank my parents Markandeya Swami Nandigam, Lakshmi Susila Nandigam and

my brother Balakrishna Nandigam for giving me with unfailing moral support to complete the

research.

I would also like to thank my friend Durga Amruth Sagar for helping me in my research. I would

like to thank my friends for all their support and encouragement.

iii
Acknowledgements reflect the views of the author and are not endorsed by committee
members or Oklahoma State University.
Name: SRINAGAVALLI NANDIGAM

Date of Degree: DECEMBER, 2016

Title of Study: REAL TIME ZIKA VIRUS DETECTION SYSTEM: UNKNOWN


SYMPTOMS AND VISUALIZATION

Major Field: COMPUTER SCIENCE

Abstract:

Zika is an infectious disease and there is a need to detect Zika as soon as possible. The

advent of social media provides an opportunity to detect Zika, even before a doctor visit.

In this research, we use twitter tweets to detect Zika. A real time Zika virus detection

system using neural networks has been developed in this work. We use two different

neural networks namely CC4 and MLP. The CC4 neural network helps in detection of

Zika that contains previously unknown symptoms and the Multi-Layer Perceptron neural

network helps in detection of known symptoms of Zika accurately. The outputs from

these two neural networks are used in classification of Zika. Apache spark is used for real

time analysis of twitter data. Once the virus has been detected, the information is useful

only if the data is presented in a form that healthcare providers and others can benefit

from. We developed three different models namely Geographical, Text and Temporal to

visualize the data. Our results show that the Zika virus can be detected with 83%

accuracy using twitter data.

iv
TABLE OF CONTENTS

Chapter Page

I. INTRODUCTION ......................................................................................................1

1.1 Problem Statement .............................................................................................2


1.2 Problems in Earlier Works .................................................................................2
1.3 Proposed Solution .............................................................................................2

II. REVIEW OF LITERATURE....................................................................................3

2.1 Related Work ....................................................................................................3


2.2 Artificial Neural networks (ANN) .....................................................................4
2.3 Neural Networks ...............................................................................................5
2.4 The Artificial Neuron .........................................................................................5
2.5 CC4 Neural Network ........................................................................................6
2.6 Apache Spark .....................................................................................................7
2.6.1 Terminology..............................................................................................8
2.7 Tableau ...............................................................................................................9

III. METHODOLOGY ................................................................................................10


3.1 Approach ..........................................................................................................10
3.1.1 Data Collection ......................................................................................11
3.1.2 Data Preprocessing .................................................................................11
3.1.3 Training ...................................................................................................11
3.1.4 Data Modeling ........................................................................................11
3.2 Dataset ……………………………………………………………………….11
3.2.1 Pre-processing Twitter Data ..................................................................12
3.2.2 Validation of Keywords extracted using Twitter by CDC......................12
3.3 Implementation ...............................................................................................14
3.3.1 Pre-processing of keywords as an input for Neural Networks ………...14
3.3.2 Conversion into Decimal Format ……………………………………....14
3.3.3 Conversion into Floating Point ………………………………………...14
3.3.4 Training and Testing Dataset …………………………………………..15

v
Chapter Page

3.4 Implementation of CC4 in Apache Spark …………………………………...16


3.5 Post-Processing Unit ………………………………………………………...19
3.5.1 Known Symptoms...................................................................................19
3.5.2 Unknown Symptoms…………………………………………………...20
3.6 Data Modeling ……………………………………………………………….21
3.6.1 Geographical Extent …………………………………………………...21
3.6.2 Text Model …………………………………………………………….22
3.6.3 Temporal Extent......................................................................................22

IV. FINDINGS .............................................................................................................23

4.1 Accuracy of Best Percentage Match before Training Unknown Symptoms ...23
4.2 Accuracy of Radius of Generalization ............................................................25
4.3 False Positives and False Negatives ...............................................................25
4.4 Visualization ....................................................................................................27
4.4.1 Geographical Extent Analysis …………………………………………27
4.4.2 Text Model Analysis ..............................................................................30
4.4.3 Temporal Extent Analysis ......................................................................32

V. CONCLUSION ......................................................................................................34

REFERENCES ............................................................................................................35

vi
LIST OF TABLES

Table Page

3.1 Symptoms of Zika recorded by CDC..................................................................13


3.2 Few Examples of Valid Data with keywords extracted ......................................13
3.3 Quantization Mapping Table ..............................................................................15
3.4 Known Symptoms ...............................................................................................20

vii
LIST OF FIGURES

Figure Page

2.1 An Artificial Neuron .............................................................................................6


2.2 General CC4 Network Architecture......................................................................7
2.3 A Simple Spark Topology ....................................................................................9
3.1 Proposed Architecture .........................................................................................10
3.2 Spark Architecture ..............................................................................................16
3.3 Best Percentage Graph for CC4 ..........................................................................19
4.1 Best Match Graph of CC4 Neural Network before Training ..............................24
4.2 Best Match Graph of CC4 Neural Network after Training.................................25
4.3 Accuracy of CC4 for different ROG ...................................................................25
4.4 False Positives and False Negatives for MLP.....................................................26
4.5 False Positives and False Negatives for CC4 .....................................................26
4.6 Geographical Extent of Zika worldwide .............................................................28
4.7 Geographical Extent of Zika in Brazil ................................................................29
4.8 Geographical Extent of Zika in Colombia ..........................................................30
4.9 Text Analysis of Zika..........................................................................................31
4.10 Major Symptom affected – Country .................................................................32
4.11 Zika Timeline Activity......................................................................................33

viii
CHAPTER 1

INTRODUCTION

The Zika virus was discovered in the African continent in the year 1947 and is carried by

monkeys. Since then it has been found that Aedes mosquitoes also carry this virus. The virus

spreads through sex, blood transfusion, and a pregnant woman. During pregnancy, an infected

pregnant woman can pass the virus onto her fetus causing microcephaly in new born babies.

Microcephaly is a birth defect where a baby’s head is smaller in comparison to babies of same

sex and age [1]. In many cases, babies have smaller brains with no proper development.

Microcephaly also causes other problems like vision, hearing loss, improper growth, disability in

learning and problem solving. The major symptoms of Zika virus include fever, skin rash,

headache, joint pain, conjunctivitis, and muscle pain. A large number of people were affected in

Brazil in 2015 and a few cases have been reported in other countries as well in 2016. Therefore,

predicting the potential spread of the disease is very important for timely health care intervention.

In this work, we collect and analyse data from social media to predict the spread of the Zika

virus.

The important goals of this work are to use social media to detect Zika through real time analysis

of Zika virus related tweets. Once Zika has been detected, we need to visualize different attributes

of the extent of Zika spread.

1
1.1 Problem Statement

Infectious diseases like Zika virus should be detected as soon as possible because we cannot wait

till a patient has seen a doctor. Gathering information on spreading virus has been achieved

through social media (Twitter tweets). There are two types of data considered: firstly, symptoms

which are related to Zika and known; secondly, symptoms which may cause Zika but are not

known. Once Zika has been detected, we visualize multiple dimensions of the spread of the

disease such as time dimension, location, and text.

1.2 Problems in Earlier Works

There exist some problems in the earlier works. Existing work [2] to detect the spread of Zika

virus is based on stored datasets using statistical analysis. No real-time analysis has been done so

far. The previous work is not scalable because the use of social media data and stored health data

requires a big data approach. The different dimensions of Zika spread have not been considered in

previous works.

1.3 Proposed Solution

We use of social media and other data have been considered to detect Zika which requires real-

time streaming and analysis of big data. Twitter data is streamed using flume and we developed a

Real-Time Zika detection system using Apache Spark. Apache Spark is a distributed, fault

tolerant, real-time stream processor for big data.

 Multi Layered Perceptron(MLP) neural network is used for detection of

known symptoms as input data. It requires a lot of training and need

reliable symptoms data. MLP neural network uses existing knowledge to

detect Zika.

 CC4 neural network provides an instantaneous response. It requires minimal

training and detects Zika virus even if some of the symptoms are not known.

2
 But the accuracy of CC4 neural network is less compared to MLP neural

network.

 Outputs from these two neural networks (CC4 and MLP) are used to classify

based on the known and unknown symptoms.

To provide Zika information in multiple dimensions to health care providers, data is shown

visually based on:

 Location which provides a geographical extent of the spread.

 Text which provides a textual description of keywords to track the frequency of

words in the tweet texts.

 Date which provide a temporal extent of the spread.

The rest of the document is divided into four sections. Chapter 2 includes a review of literature

and description of neural networks and Apache Spark. Chapter 3 describes the proposed

architecture and a brief description of pre-processing of data and methodology of the Zika virus

detection system. Chapter 4 presents the results of simulations. Chapter 5 concludes the thesis

with the suggestions for future work.

3
CHAPTER 2

REVIEW OF LITERATURE

2.1 Related Work

With real-time streaming of virus infection data, machine learning is used for detection of a virus,

in our case, Zika. Existing techniques to detect the spread of Zika virus is based on statistical

analysis. Our goal is to use machine learning using social media data. Social media provides the

first clues to potential diseases.

In 2016, New England Journal of Medicine has released a report on the Zika virus in Colombia.

They used the national population-based surveillance system to access patients with symptoms of

the Zika virus during August 2015-2016. They also evaluated infected pregnant women test

reports of microcephaly [2]. The research is done primarily on pregnant women with

microcephaly. Data is extracted independently based on study designs, countries key findings,

symptoms, childbirth, and pregnancy [3].

In 2015, Dan Xiao, Dongli Li published an open article on predicting epidemic trends and

evaluating an intervention of Ebola virus disease in 2014-2015 [4]. The Ebola virus spread is

evaluated based on a periodic variation of Ebola disease using differential

4
equations on susceptible, infective, and removed modeling [4]. To predict the transmission

patterns of Ebola disease, they constructed a compartment model. The number of Ebola virus

cases filed and deaths occurred are compared based on the data provided by the World Health

Organization. These models proposed that early detection and diagnosis is required to control

major outbreaks of the Ebola virus disease [4].

In 2013, Cory W. Morin, Andrew C. Comrie, and Kacey Ernst published a paper on how the

Dengue virus has spread widely and affected millions of people [5]. Aedes genus mosquitoes

transmit dengue virus. Analysis has shown that nearly 400 million cases may get recorded per

year [18]. The researchers developed a hypothesized relationship between Aedes mosquitoes,

weather, dengue, and climate. They drew the relationships based on laboratory results and

performed statistical analysis [18]. The test results are generated by analysis between climate and

dengue transmission, laboratory results and field studies on vector and dengue virus. They drew

predictive analysis based on climate data and weather [18].

In 2013, a paper published by Kathy Lee, Ankit Agrawal, Alok Choudhary on real time digital flu

surveillance of United States used data from social media twitter. They built a novel flu

surveillance system that uses twitter data to track flu and cancer activities in real-time [6]. They

have drawn results visually for US disease surveillance maps, distribution and timelines of

disease types, symptoms, and treatments [6].

2.2 Artificial Neural Networks (ANN)

An artificial neural network [8] is a structure of the biological neural network based on

computational functions. A neural network learns depending on the input and output. It functions

like the brain. It is composed of large number of interconnected processing neurons to solve

specific problems. An Artificial Neural Network [] is configured for a specific application, such

5
as pattern recognition or data classification, through a learning process. There are different

types of neural networks, but learning is done in two ways – supervised learning and

unsupervised learning.

In supervised learning, the neural network is provided with both input and output datasets during

training to get the desired outputs. In unsupervised learning, the network will learn on the

characteristics i.e., output will not be known.

2.3 Neural Networks

Neural networks are widely used in pattern recognition because of their ability to generalize and

to respond to unexpected inputs/patterns. Usually, the neural networks will have three layers –

input layer, hidden layer, and an output layer. The input layer is connected to the hidden layer

and the hidden layer is connected to the output layer. But some neural networks will not have

hidden layer like Perceptron for example.

Neural networks learn over the time by training. During training, neurons are taught to recognize

various specific patterns and whether to fire or not when that pattern is received.

2.4. The Artificial Neuron

An artificial neuron [7] is a mathematical function conceived as a model of biological neurons.

Artificial neurons are the constitutive units in an artificial neural network. The artificial neuron

receives one or more inputs (representing dendrites) and sums them to produce an output

(representing a neuron’s axon) [22]. In a biological neuron, we have three important types:

dendrites, soma, axon.

Dendrites receive signals from other neurons. The signals are electric impulses that are

transmitted across a synaptic gap. The soma sums the incoming signals i.e., the input signals

6
multiplied by the weights. When sufficient input is provided, the cell fires. When the sum of

values is greater than or equal to a threshold value, then the cell fires the output [23].

Figure 2.1: An Artificial Neuron

2.5 CC4 Neural Network

The CC4 Neural Network is an Instantaneously trained neural network proposed by Kak [10]

[11]. CC4 is a feed-forward neural network. CC4 requires fast learning because the biological

neurons produce instantaneous results. It has three layers which are

 Input layer

 Hidden layer

 Output layer

The input layer takes its input in unary format. All the inputs converted into unary format. For

each input data, consider a biased neuron which set to 1. The weights are assigned from the input

layer to hidden layer. All the neurons in the input layer and hidden layer are fully connected. All

neurons in hidden layer correspond to a single training data in the training dataset. The output

layer provides the output to the network. As the input layer and hidden layer are connected, the
7
hidden layer and output layer are also fully connected [12]. The CC4 neural network general

architecture is shown below figure [12] [13]:

Figure 2.2: General CC4 Network Architecture

It uses a concept known as radius of generalization. This helps in classification of input vectors

based on the class of stored vectors. If the hamming distance between the new input vector and

any of the stored vectors is less than or equal to the user-specified radius, the outputs of all such

stored vectors is considered for generating the output of the input vector. The number of 1s and 0s

in every bit location of the output vector of all these stored vectors is calculated and added up. If

the result is positive, the corresponding output neuron outputs 1 otherwise the output is 0 [11]

[12].

2.6 Apache Spark

Apache Spark [14] is a cluster computing technology, designed for fast computation. It is based

on Hadoop Map Reduce and it extends the Map Reduce model to efficiently use it for more types

of computations, which includes interactive queries and stream processing. The main feature of

8
Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative

algorithms, interactive queries and streaming. Apart from supporting all these workloads in a

respective system, it reduces the management burden of maintaining separate tools.

2.6.1 Terminology

Application

Application is a user program built on Spark that consists of a driver program and executors on

the cluster.

Driver Program

Driver program is the process running the main () function of the application and creating the

Spark Context.

Cluster Manager

Cluster manager is an external service for acquiring resources on the cluster.

Worker Node

Worker node is any node that can run application code in the cluster.

Executor

Executor is a process launched for an application on a worker node, that runs tasks and keeps data

in memory or disk storage across them. Each application has its own executors.

Task

Task is a unit of work that will be sent to one executor.

Job

Job is a parallel computation consisting of multiple tasks that gets spawned in response to a Spark

action (e.g. save, collect).

Stage

9
Stage is about each job being divided into smaller sets of tasks called stages that depend on each

other (like the map and reduce stages in Map Reduce).

Figure 2.3: A Simple Spark Topology

2.7 Tableau

Tableau [20] not only deals with creating a visualization of data but also analyzes

it and use various forecasting and churn analysis methods. Tableau helps business users to draw

better insights to visualize data efficiently. It connects almost to all the available data sources via

pre-built data connectors, both matrix format and multi-dimensional formats, and helps create

instantaneous dashboard visualizations in less time compared to conventional methods.

Aesthetics add to the functionality as Tableau provides the ability to change layout, colors, and

alignments and efficiently for huge amount of too [21].

10
CHAPTER 3

METHODOLOGY

3.1 Approach

Figure 3.1: Proposed Architecture

We propose a real-time system in which Apache Spark works as a Real-time streaming processor.

The proposed system consists of 4 steps:

 Data Collection

 Data Pre-processing

11
 Training

 Data Modeling

3.1.1 Data Collection

Tweet text is a short text message limited to 140 characters in length posted by users on Twitter.

Data related to Zika Virus is collected using Twitter. Apache Flume is used to retrieve data from

Twitter using keywords. This twitter text will be in JSON format and contains tweet text,

username, time-stamp, and location related to Zika virus.

3.1.2 Data Preprocessing

The data preprocessor module will convert these data to text format. We need to

convert the data related to Zika virus into Decimal format (works for MLP) and into

Unary format (works for CC4).

3.1.3 Training

The proposed system consists of two types of Neural Network Training:

The CC4 neural network is an instantaneously trained neural network. MLP neural network is a

two-layered feed forward neural network using Back Propagation Technique to train the network.

3.1.4 Data Modeling

Once the Zika virus has been detected, we visualize the data to provide useful Zika information to

health care providers. These visualization models include,

 Geographical Extent to track the spread of Zika by geographic region by

measuring the volume of Zika tweets generated.

 Text Model to discover useful information related to symptoms of Zika.

 Temporal Extent to track the volume changes of the tweets over time

3.2 Dataset

We use the Twitter for Zika Virus Detection System streamed using flume. It is the

12
benchmark dataset collected from January to July. It is a labeled dataset consisting of

Zika Virus with Zika #tag.

Part of the dataset (which is most recent) is reserved for validating the model and is not

used in the training process. We use 60% of Zika tweets and non-Zika tweets for training

and remaining 40% of Zika and non-Zika tweets for testing. Our dataset consisted of

almost 2 million tweets collected from Jan 2016 to July 2016.

3.2.1 Pre-processing Twitter Data

Data obtained from the twitter application will be in JSON format. We filtered the JSON file to

normal text file by using specific keywords (username, location, tweets, symptoms, timestamp).

The text file will have some irrelevant data because keywords specified can occur in multiple

contexts.

3.2.2 Validation of Keywords extracted using Twitter by CDC

We referred to the CDC [18] website for the validation of Zika keywords. Using these keywords,

we extracted the keywords from the Twitter dataset. Table 3.1 represents the symptoms that are

recorded by CDC and considered only those symptoms (dataset) for training and testing purposes.

This serves as a ground truth in our implementation [18].

Serial Number Symptoms of Zika recorded by CDC

1 Fever

2 Rash

3 Joint pain

4 Muscle pain

5 Conjunctivitis

13
6 Headache

7 Microcephaly

Table 3.1: Symptoms of Zika recorded by CDC

 Table 3.2 shows examples of tweets mentioning keywords listed in Table

3.1. Many users describe their Zika symptoms.

Data Keywords Extracted Location

Yo…Here, I come from playing football and I Rash Nicargua


discover that I have RASH ...#Zika, are
you?nicargua
This is Florida: Rash-check. Conjunctivitis- Rash Florida
check.Fever- check. Been out of country- Conjunctivitis
negative.Prescription- Prednisone. #Zika Fever

Salmon to red coloured, maculopapular rash" Red Eyes California


marked #Sandi ego’s #Zika #sex case
California, US
#Zika Acute signs & symptoms: Joint Pain Joint Pain United States
Achy, fever, rash and that's it. How goofy is she. Fever La liberated
#Zika WLa Liberated, El Salvador Rash El Salvador
Cra. Rosario: #Minsa reports the first case of Microcephaly Vichida
Microcephaly with a pregnant sister who went to Colombia
private practice ... #Zika and rash Vichada,
Colombia
Today Adriana got the joint pain, #Zika story Joint Pain Dominican
Dominican
My whole family already had #Zika. You may not Rash La_Union
experience all the symptoms. Most certainly rash, Fever
fever and headache.La_Union Head Ache
Um, uh. Thought pinpricks on my arm were from Rash Colombia
cat claws while I slept. Looking at it now, looks
like rash and found I have Zika #Zika
Colombia

Table 3.2: Few Examples of Valid Data with keywords extracted

14
3.3 IMPLEMENTATION

3.3.1 Pre-processing of keywords - input for Neural Networks

In this process, we convert the input dataset to unary code and decimal format. CC4

neural network accepts only unary code as input. The MLP neural network accepts

input in decimal format.

3.3.2 Conversion into Decimal Format

The dataset in the text format is converted into decimal format. To convert, for each keyword

extracted during preprocessing we will assign a unique ID starting from 0. This is different for

each keyword in the dataset, resulting in the format where a dataset contains the values for each

keyword indicating its unique identification number. Among the two Neural networks we are

using in our implementation, MLP requires input to be in the decimal format so there is no need

for further processing of data for MLP. The other Neural Network, CC4 requires the input in

unary format and hence it needs to undergo preprocessing. Direct conversion from decimal to

unary is not possible. So, we need to convert the decimal format to floating and then to unary.

3.3.3 Conversion into Floating point

To convert the decimal format data into unary format for the CC4 network. We calculate the ratio

of a total number of Zika viruses detected that contains the symptom to that of total viruses.

The floating ratios range from 0.0 to 0.9. For each keyword, we will assign a unique id.

We generate a quantization mapping table based on floating values. The mapping table includes a

unique unary value for all the floating values. Therefore, all the keywords will be assigned with a

unary value.

Range Mapping

0.0 0000000000000000

15
0.00000001 - 0.0000001 0000000000000001

0.0000001 - 0.000001 0000000000000011

0.000001 - 0.00001 0000000000000111

0.00001 - 0.0001 0000000000001111

0.0001 - 0.001 0000000000011111

0.001 - 0.01 0000000000111111

0.01 - 0.1 0000000001111111

0.1 - 0.3 0000000011111111

0.3 - 0.55 0000000111111111

0.55 - 0.9 1111111111111111

Table 3.3: Quantization Mapping Table

3.3.4 Training and Testing Dataset

The proportions of Zika and non-Zika data we use are 70% and 30 % respectively. Part of the

dataset (most recent) is reserved for validating the model. For training and testing, the proportions

considered are 60 % and 40 % respectively.

In this process, the CC4 is efficient in finding Zika known and unknown symptoms. But the

correctness of the results has a low percentage. To improve the accuracy of the results, we

consider the output of MLP [19]. The MLP output will be compared with CC4 output in the post-

processing unit.
16
3.4 Implementation of CC4 in Apache Spark

Figure 3.2: Spark Architecture

The dataset is converted into unary in the pre-processing phase. CC4 which is an instantaneously

trained neural network can detect Zika even if some of the symptoms are unknown. In the CC4

training process, all the training inputs are processed using CC4 training algorithm.

During training, the Spark driver application has its own executors on the cluster which remain

running as long as the Spark driver application has the spark context. Spark context is a main

entry point for spark functionality and represents the connection for the cluster. The connection is

to establish spark cluster (we can use spark context or spark session). The cluster manager sends

the processed data to the executor nodes. Once the input data is received, the number of 1’s in

each input data is considered. The bias neuron has its constant value set to 1. For each training

data, the weights from input to hidden layer are computed. For each input bit, if it is 1 then assign

weight=1 else assign weight= -1. We need to choose the proper radius of generalization. The

radius should be chosen carefully because it is a user defined value where the output value

changes depending on it. For our input r =2, CC4 has better accuracy in detecting Zika. We

calculate the weight for the bias neuron using the formula weight=r-s+1 where s is the number of

17
1’s and r is the radius. The weights are multiplied with the input. The weights from the hidden

layer to the output layer are computed. For each input bit, if it is 1 then assign weight=1. If it is

not 1 then assign weight= -1. The weights are multiplied and added together. If the summation

value is above the threshold, a 0 is output else a 1 is output. The obtained result is stored as a CC4

trained data.

In the testing process after the output using the testing data will be compared with the actual

values of whether it is Zika or not. For different number of inputs, the best percentage match of

the output is calculated. For example: if we have 40 input data, and only 35 give the correct

output, or the percentage match accuracy is 35/40, i.e., 87%. After running on different inputs, we

found that the best percentage occurred at 80%. If the best percentage match is above the

threshold, it outputs as 0. If the best percentage match is below the threshold, CC4 outputs the

result as 1. Then the executor node sends this output to the post-processing unit for the

classification of Zika using known and unknown symptoms.

Implementation for training and testing:

The entry point into all functionality in Spark is the SparkSession class (spark context).

To create a Spark Session, we use Sparksession.builder().

SparkSession spark = SparkSession.builder().master("local[*]")


.appName("CC4")
.getOrCreate()

Set the path for data using Java RDD (Resilient Distributed Datasets)

JavaRDD<String> lines = spark.read().textFile().javaRDD(

Map the input which is comma seperated and load into RDD

JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()


Arrays.asList(SPACE.split(s)).iterator();

Map RDD for the first iteration of layers

JavaPairRDD<String, String> ones = words.mapToPair(new PairFunction<String, String,


String>()
18
Repeat until all input data is executed

Count the number of ones (countones) in each input data for biased neuron

Assign the weights to each input data for input layer

for k = 1 to length-1 do
if (temp[k-1].equals("1"))
assignedforeachbit[k] = 1
countones++
else
assignedforeachbit[k] = -1

For the biased neuron, calculate weight using formula.


r is the radius of generalization (user defined)

Initially biasfinalweight = 0
biasweight = r – (countones) + 1
if (biasweight == 1)
biasfinalweight = 1
else
biasfinalweight = -1

To compute the activation function


weightsafteroriginalmultiply = biasweight * biasfinalweight;

Map rdd for second layer

JavaPairRDD<String, String> weightsofweightsafteroriginalmultiply = ones.mapToPair(new

PairFunction<String, String, String>()

Output 0 for known symptom and 1 for unknown symptom

End repeat

19
Best Percentage Graph for CC4
120

100
Best Percentage Match

80

60
Best Match
40

20

0
0 50 100 150 200 250 300 350 400 450
Number of input data

Figure 3.3: Best Percentage Graph for CC4

3.5 Post-Processing Unit

The outputs from both the neural networks MLP and CC4 are input to the post-processing unit.

The executor node contains the output from CC4 and MLP. We will create a data frame to

connect to the outputs. Once it is created, the CC4 output and MLP output values are compared.

Once the CC4 output indicates 1, the CC4 executor node will call MLP-2 and train for the new

symptom. The post-processing unit will now detect whether the input is known or unknown.

3.5.1 Known Symptoms

In this, both the outputs from CC4 and MLP are compared, and the post-processing unit will give

results as below:

CC4 Output MLP Output [19] Final Output

0 0 Zika

0 1 Non-Zika

20
1 0 Unknown

1 1 Unknown

Table 3.4: Known Symptoms

3.5.2 Unknown Symptoms

As discussed earlier, if the CC4 output indicates 1 irrespective of the MLP output, the post

processing unit classifies the disease as unknown. If the CC4 outputs 1, we introduce a new MLP

in the executor node called MLP-2 which works offline and implement this separately. MLP-2 is

now trained with the new unknown symptoms that were input to CC4 which classified the input

data as Zika. MLP-2 is therefore now trained to detect Zika for both known and previously

unknown symptoms and it goes online and MLP 1 goes offline. and is updated so that the

actual MLP will detect this new symptom and now will be able to detect this symptom as

it is no more an unknown symptom.

Pseudocode: for all the outputs,

do

create the data frame for the outputs generated by CC4 and MLP

DataFrame dataFrame = sqlContext.createDataFrame(data,


LabeledPoint.class)

Mapping output from CC4 and MLP

val cc4 = seq (“0”, “1”). map (Tuple1.apply)

val mlp=seq (“0”,”1”). map (Tuple2.apply)

compare both output values

if (cc4==0 && mlp==0)

output “0” indicating presence of zika

else if (cc4==0 && mlp==1)

21
output “1” indicating presence of non-zika

else if (cc4==1 &&mlp==1|0) //Indicates Unknown symptom

Call MLP-2 and implement MLP separately, train the unknown symptom and
update [19]

MultilayerPerceptronClassifier trainer = new


MultilayerPerceptronClassifier ()
Train the data
MultilayerPerceptronClassificationModel model = trainer. Fit(train)

3.6 Data Visualization

Since the Neural network only detects the presence of Zika virus and it does not give any

information in which region it has spread, the timeline or the symptoms. We generate a visual

representation of the Geographical extent by extracting the location in our dataset to provide an

Activity Map (Zika virus in different locations), text model that shows symptoms and the

temporal extent to track the volume changes of the Zika tweets over time.

3.6.1 Geographical Extent

The goal of visualizing the geographical extent is to track the spread of Zika virus by geographic

region using the Zika tweets. A tweet or a user can have two types of location, a text-based user

profile location or a sensor-based geolocation. User profile location is a user-entered random text

that they declare as their home location. Sensor-based tweet location is an actual geo-location

(with longitude and latitude values) of a user provided with a tweet [6]. We consider the user

profile location. We will ignore invalid information such as any text other than country/state

name. The location information is chosen based on country name and state name.

Tableau is used to show the map distribution. Once we connect to the Tableau server,

 We get the output from the neural networks, then we will create a csv file

which contains the information about Zika virus detection.

22
 Connect to data using our input data .csv file.

 A work dashboard will load for our input data.

 Mark the county name to a geographical role.

 The map will be generated based on longitude and latitude values (country/

state name).

 Filter the unrecognized data by selecting the respective country/state name in

place of default location given.

 In measures, select the value as, i.e., a number of Zika tweets for each location

is selected.

 A map will generate for all the countries and Country with their respective

states.

3.6.2 Text Model

The goal of text modeling is to find useful health information. We are interested in investigating

Zika symptoms [6]. As discussed in the dataset collection, we create a keyword list. For example,

the keyword list for Zika virus is fever, muscle pain, conjunctivitis, joint pain, microcephaly,

headache, etc. For each symptom, we count the number of tweets that contain. We can create pie

charts, bar charts using visualization tool tableau.

3.6.3 Temporal Model

The goal of temporal modeling is to track the volume changes of the Zika tweets over time. The

volume change of keyword `Zika' over time is a good reflection of the Zika activity level change

over time [6]. For the temporal model, we count the number of Zika virus related tweets

generated. This data is used to create the Zika activity level timeline.

23
CHAPTER 4

FINDINGS

4.1 Accuracy of Best Percentage Match before Training Unknown Symptoms

While testing the detection of Zika virus using CC4, if the best percentage match falls below 80%

(Threshold), it is a sign that there is an unknown symptom in the data. To test this, data that

contains previously some of the unknown symptoms are input to the CC4 network. Out results

show that the best percentage match fell below 80% signaling it as an unknown symptom. Later

we trained the MLP for these unknown symptoms. After training, the best percentage match with

these new symptoms rose above 80% indicating the MLP is trained for the new symptom. The

previously unknown symptoms in the data include symptoms like microcephaly.

24
Figure 4.1: Best Match Graph of CC4 Neural Network before Training

Figure 4.2: Best Match Graph of CC4 Neural Network after Training

25
4.2 Accuracy of Radius of Generalization

CC4 provides instantaneous results with minimal training. CC4 uses a radius of generalization

while calculating the weight for bias neuron where it differentiates the trained symptoms and new

symptoms. We tested the accuracy by providing different user defined radius of generalization

values ROG = 0, 1, 2. Different radius values are trained and tested with a different number of

input data. We found that the CC4 gives the best accuracy when ROG = 2. The x-axis is input

data which contains both symptoms for zika and non-zika.

CC4 Accuracy with selection of Radius of


Generalization
120

100
Percentage Accuracy

80

60

40

20

0
0 20 40 60 80 100 120
Total number of viruses

ROG=0 ROG=1 ROG=2

Figure 4.3: Accuracy of CC4 for different ROG

4.3 False Positives and False Negatives

Not all recognition systems are 100% accurate, since they are trained using data which may or

may not match with the data we use for testing. Therefore, there are some possibilities for false

positives and false negatives. In our thesis, we are using CC4 and MLP neural networks for

detection of Zika based on symptoms related to Zika. To test the false positives and false
26
negatives, we use 400 random inputs from the dataset which constitutes 180 known symptoms

and 220 unknown symptoms. The MLP neural network shows 0% false positive and 0% false

negative which is shown in figure 4.4. The CC4 neural network has 5.12% false positive and 0%

false negative which is shown in figure 4.5.

False Positive and False Negative for MLP


1.2

0.8
Output (0/1)

0.6

0.4

0.2

0
0 50 100 150 200 250 300 350 400 450
-0.2
Number of input values

Figure 4.4: False Positives and False Negatives for MLP

False Positive and False Negative for CC4


1.2

0.8
Output (0/1)

0.6

0.4

0.2

0
0 50 100 150 200 250 300 350 400 450
-0.2
Number of input values

27
Figure 4.5: False Positives and False Negatives for CC4

4.4 Visualization

We obtained the Zika detection results in the form of 0 and 1 which indicates the presence of Zika

and non-Zika virus respectively. To visualize these results and provide them to the health care

providers in an efficient manner, we have developed three visualization models

 Geographical Extent

 Text Model

 Temporal Extent

4.4.1 Geographical Extent Analysis

We generated the country/state name and percentage of Zika tweets in the given country/state.

For each country, the color shade varies. The darker locales show a high percentage of Zika virus.

The lighter locales show a low percentage of Zika virus spread. The results show the percentage

of Zika present when the cursor is placed on country/state name in Tableau.

28
Figure 4.6: Geographical Extent of Zika worldwide

The below map show the Brazil distribution with the percentage of Zika virus present with the

state name. The least spread of the virus is in Roraima and Rio_de_Janeiro state with 0.03%.

Bahia state has the highest percentage of Zika spread, i.e., 14.39% in Brazil.

29
Figure 4.7: Geographical Extent of Zika in Brazil

The below map shows Colombia Zika activity with valid state names. The highest percentage

spread of Zika virus in a single state in Colombia is 18.5% where as in some states Zika is not

present at all.

30
Figure 4.8: Geographical Extent of Zika in Colombia

4.4.2 Text Model Analysis

In text analysis, we revealed the percentile of Zika symptoms by investigating the contents of

tweet results from neural networks. From the pie chart using Tableau, it observed that the

microcephaly symptom has the high number (40.68 %) of tweets generated followed by fever

(12.30%) and rash (12.13%).

31
Figure 4.9: Text Analysis of Zika

The below chart shows the major symptoms which affected a country. Brazil has more tweets on

the microcephaly symptom. It also tells that among all the major symptom which affected each

country, microcephaly has more tweets generated. The darker color shows the greater Zika

related tweets. The lighter color displays the fewer number of Zika tweets. The dark blue value

range from 9866 – 5110, the middle blue range from 5110 – 2500, and the light blue range from

2500 – 24.

32
Figure 4.10: Major Symptom affected - Country

4.4.3 Temporal Extent Analysis

We counted the number of Zika related tweets generated from the month of January to July with

the associated keywords. This data is used to create the Zika activity level timeline. The below

chart show the percentage of the sum of tweets for every month. The month of January has the

lowest extent of Zika spread. There is a drastic change in tweets related to Zika generated from

April to May.

33
Figure 4.11: Zika Timeline Activity

34
CHAPTER 5

CONCLUSION

An efficient real time Zika detection system has been developed using Apache Spark. The

detection mechanism in the proposed model provides instantaneous and accurate results because

it uses the CC4 instantaneous neural network and the multi layered perceptron neural

network. Different models have been developed to visualize the extent of the geographical, text,

and temporal spread of Zika.

For future work, we can apply Neural Network techniques for detecting the spread of other types

of diseases. Instead of twitter data we can consider datasets that include medical and behavioral

interventions to provide more accuracy. We can implement this work in real world. Other types

of visualization like QlikView using more dimensional data can be used.

35
REFERENCES

[1] Zika Virus, https://www.cdc.gov/Zika/healtheffects/birth_defects.html, November 22

2016.

[2] Oscar Pacheco, Maurico Beltran. Zika Virus Disease in Colombia – Preliminary Report.

The New England Journal of Medicine, DOI: 10.1056/NEJMoa1604037, June 15 2016.

[3] Chibueze EC, Tirado V, da Silva Lopes K, Balogun OO, Takemoto Y, Swa T. Zika

virus infection in pregnancy: a systematic review of disease course and complications.

Bulletin World Health Organ., June 9 2016.

[4] Zuiyuan Guo , Dan Xiao , Dongli Li , Xiuhong Wang, Yayu Wang, Tiecheng

Yan, Zhiqi Wang., Predicting and Evaluating the Epidemic Trend of Ebola Virus

Disease in the 2014-2015 Outbreak and the Effects of Intervention Measures

Published: April 6, 2016http://dx.doi.org/10.1371/journal.pone.0152438, April

2016

[5] Cory W. Morin, Andrew C. Comrie, Kacey Ernst. Climate and Dengue Transmission:

Evidence and Implications, Environmental Health Perspectives, Vol 121, Issue 11-12,

November-December 2013.

[6] Kathy Lee, Ankit Agrawal, Alok Choudhary. Real-Time Digital Flu Surveillance using

Twitter Data. Proceedings of the 19th ACM SIGKDD international conference on

Knowledge discovery and data mining (KDD ’13), pp. 1474-1477, 2013.

36
[7] Igor Aleksander, Helen Morton, An Introduction to Neural Computing, Intl. Thomson

Computer Pr(T), October 1995.

[8] Artificial neuron. https://en.wikipedia.org/wiki/Artificial_neuron. Last accessed October

29, 2016.

[9] Kaushik Bose, An Introduction to Artificial neural network,

https://www.academia.edu/7468404/Artificial Neural Network, last accessed October 29,

2016.

[10] Goutam Mylavarapu. Instantaneous Intrusion Detection System. Master’s thesis,

Department of Computer Science, Oklahoma State University, 2015.

[11] Subhash Kak. New Algorithms for training feedforward neural networks. Pattern

Recognition Letters, Vol 15, No.3, 1994.

[12] Sumanth Reddy. Generalization and Efficient implementation of CC4 Neural Network

Master’s thesis, Department of Computer Science, Oklahoma State University, 2008.

[13] Kun-Won Tang and Subhash C Kak, “A new corner classification approach to neural

network training”, Circuits, Systems and Signal Processing, Vol.17, No.4, pp. 459 469,

1998.

[14] Rohit Pillay. Instantaneous Intrusion Detection System. Master’s thesis, Department of

Computer Science, Oklahoma State University, 2010.

[15] Apache spark. https://spark.apache.org/. Last accessed 29 November 2016.

[16] Wikipedia. Spark https://en.wikipedia.org/wiki/spark. Last accessed November 18, 2016.

37
[17] Holden Karau, Andy Konwinski, Parick Wendell and Matei Zaharia. Getting Started with

Spark. O’Reilly, 2015.

[18] Zika, http://www.cdc.gov/Zika/about/overview.html . Last accessed October 11, 2016.

[19] Durga Amruth Sagar, Real-Time Zika Virus Detection System With Known Symptoms

and Prediction. Master’s thesis, Department of Computer Science, 2016.

[20] Tableau. https://interworks.co.uk/business-intelligence/why-tableau/. Last accessed May

30, 2016.

[21] Katherine Noyes, Tableau’s new BI analytics Suite.

http://www.cio.com/article/2900253/six-features-coming-to-tableaus-new-bi-analytics-

suite.html. Last accessed March 20, 2016.

38
VITA

Srinagavalli Nandigam

Candidate for the Degree of

Master of Science

Thesis: REAL TIME ZIKA VIRUS DETECTION SYSTEM WITH UNKNOWN


SYMPTOMS AND VISUALIZATION

Major Field: COMPUTER SCIENCE

Biographical:

Education:

Completed the requirements for the Master of Science in Computer Science at


Oklahoma State University, Stillwater, Oklahoma in December, 2016.

Completed the requirements for the Bachelor of Technology in Information


Technology at Hindustan College of Engineering, Chennai, India in May, 2011.

Experience:

Professional Memberships:

You might also like