You are on page 1of 120

NORTHWESTERN UNIVERSITY

Mining Social Media for Healthcare Intelligence

A DISSERTATION

SUBMITTED TO THE GRADUATE SCHOOL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

for the degree

DOCTOR OF PHILOSOPHY

Field of Computer Science

By

Kathy Lee

EVANSTON, ILLINOIS

December 2017




ProQuest Number: 10638877




All rights reserved

INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.






ProQuest 10638877

Published by ProQuest LLC (2018 ). Copyright of the Dissertation is held by the Author.


All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.


ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
2


c Copyright by Kathy Lee 2017

All Rights Reserved


3

ABSTRACT

Mining Social Media for Healthcare Intelligence

Kathy Lee

Social media such as Twitter has risen as a powerful new communication medium

for disseminating information on news, personal interests, experiences, and opinions. On

social media, people talk about their lifestyle, health conditions and symptoms, search

information on treatment options, and connect with people who have been through simi-

lar medical experiences to get emotional support. Such health information generated by

patients or family members is not available in medical documents created by health care

providers and became publicly available only recently with the prevalent use of microblog-

ging sites, which makes social media an invaluable source of health data to mine. However,

social media data is often short, unstructured, and written in colloquial languages, and

these characteristics pose many interesting research questions.

In this thesis, we focused on mining public Twitter data for healthcare intelligence.

We designed models based on bag-of-words and social network structure features that

classify trending topics into general categories such as sports, technology and health.

This model could help identify trending topics and posts in health domain and benefit
4

information retrieval tasks by reducing the search space to a domain of interest. We

also proposed a real-time digital disease surveillance system that uses spatial, temporal,

and text mining techniques to track disease activities. Our work was motivated by the

fact that, while traditional disease surveillance systems require 1-2 weeks time to collect

and process before the data becomes publicly available, Twitter data is available near

real-time and the aggregated social media data can provide an overall health state of

the general population earlier than the traditional disease surveillance systems can. We

further built a neural network model that combines Twitter data with the observed data

from Centers for Disease Control and Prevention (CDC) to predict current and future

influenza activities. Our system can serve as a proxy for early detection of pandemics and

the resulting insights are expected to help facilitate faster response to and preparation

for epidemics. We also investigated the use of clinical knowledge sources to train deep

learning models for medical concept normalization in which health conditions described

in natural (colloquial) language are mapped to a standard clinical term. The proposed

model can help an automatic system to effectively interpret health concepts written in

layman’s language.

The studies presented in this thesis provide interesting insights into the application

of machine learning and text mining on social media data in healthcare domain. We

hope our work motivates further study of online user-generated data to gain meaningful

healthcare insights.
5

Acknowledgements

I would like to thank Prof. Alok Choudhary for advising this research and for his

constant guidance and valuable feedback. He has inspired me to work on research problems

in social media and healthcare domain. He has always given me a steady force to maintain

momentum in my research. I would also like to express my deep gratitude towards my

thesis committee members Prof. Wei-keng Liao and Prof. Ankit Agrawal.

I wish to thank my parents for their endless love and sacrifice, and providing me with

the best education. I dedicate this thesis to them. I also wish to thank my husband Seung

Woo and my two children Daniel and Ashley for patiently supporting me. Without their

encouragement and dedication, I would not have been able to successfully finish this long

journey.

Lastly, I would like to thank all members of the center for ultra-scale computing and

information security (CUCIS) lab at Northwestern University for collaborating with me

and providing me an invaluable intellectual support.


6

Table of Contents

ABSTRACT 3

Acknowledgements 5

Table of Contents 6

List of Tables 10

List of Figures 13

Chapter 1. Introduction 18

Chapter 2. Twitter Trending Topic Classification 22

2.1. Introduction 22

2.2. Related Works 25

2.3. Data and Methods 26

2.3.1. Data Collection 27

2.3.2. Labeling 28

2.3.3. Data Modeling 30

2.3.3.1. Text-based Data Modeling 30

2.3.3.2. Network-based Data Modeling 31

2.3.4. Machine Learning 33

2.4. Experiments and Results 34


7

2.4.1. Text-based classification 34

2.4.2. Network-based classification 35

2.5. Summary 36

Chapter 3. Mining Social Media Streams to Improve Public Health

Allergy Surveillance 38

3.1. Introduction 38

3.2. Our Approach 40

3.2.1. Datasets 40

3.2.1.1. Twitter dataset 40

3.2.1.2. Ground Truth Data 40

3.2.2. Methodology 41

3.2.2.1. Data Preprocessing 41

3.2.2.2. Data Classification 41

3.2.2.3. Text Mining 44

3.2.2.4. Spatio-temporal Mining 46

3.3. Experimental Results 48

3.3.1. Text Analysis 48

3.3.2. Spatio-Temporal Analysis 50

3.4. Related Works 55

3.5. Summary 57

Chapter 4. Real-Time Digital Diseases Surveillance using Twitter Data

Demonstration on Flu and Cancer 59


8

4.1. Introduction 59

4.2. System Description 60

4.2.1. Geographical Analysis 62

4.2.2. Temporal Analysis 63

4.2.3. Text Analysis 66

4.3. Summary 67

Chapter 5. Forecasting Influenza Levels using Real-Time Social Media Streams 68

5.1. Introduction 68

5.2. Related Work 70

5.3. Method 72

5.3.1. Dataset 72

5.3.2. Data Preprocessing 73

5.3.3. Feature Selection 74

5.3.4. Predictive Modeling 77

5.4. Results 77

5.5. Summary 81

Chapter 6. Medical Concept Normalization 82

6.1. Introduction 82

6.2. Related Work 85

6.2.1. Social Media for Healthcare 85

6.2.2. Deep Neural Network Models 85

6.2.3. Concept Normalization 86


9

6.3. Model Description 87

6.3.1. Convolutional Neural Network (CNN) 87

6.3.2. Recurrent Neural Network (RNN) 88

6.4. Experimental Setup 90

6.4.1. Data 90

6.4.2. Data Sources for Word Embedding 92

6.4.2.1. Thesaurus (TH) 93

6.4.2.2. Medical Dictionary (MD) 94

6.4.2.3. Clinical Texts (CT) 94

6.4.2.4. Health-related Tweets (HT) 97

6.5. Results 97

6.5.1. Ablation Study 98

6.5.2. Qualitative Analysis 100

6.6. Summary 101

Chapter 7. Conclusion and Future Research Work 102

References 104
10

List of Tables

2.1 Five most similar topics of topic “macbook” in class technology. 33

3.1 Tweets with positive and negative labels. A tweet is positive if it talks

about the author or someone around the author having allergy. A tweet

is negative if it is a question or talks about news, general awareness or

information about allergies. 42

3.2 Classification performance of various classifiers using 10-fold cross

validation. The best classification performance (F-measure of 0.811 and

ROC area of 0.905) was obtained using NaiveBayesMultinomial (NBM). 42

3.3 A list of most frequently used bigrams where the second word is allergy,

ranked by frequency of use in the entire allergy corpus. It includes many

actual allergy types that are in ‘noun noun’ POS tag. 45

3.4 30 most frequently mentioned allergy types automatically extracted

by our algorithm. Numbers indicate the rank of frequency the 2-gram

appears in the allergy corpus and +/- signs indicates whether it is

an actual allergy type(+) or not(-). 26 out of 30 were true positives

achieving a precision of 86.7%. 47


11

3.5 Most prevalent food allergies. The rank of the most prevalent food

allergies extracted from Twitter data is very similar to that obtained

from actual allergy patients’ data. 47

5.1 Examples of flu-related tweets. 74

5.2 CDC and Twitter features used in flu prediction model. 75

5.3 Twitter data improves prediction performance. 76

5.4 Comparison of current flu forecast model’s performance when different

learning rates and a varying number of hidden layers and hidden units

are used. The highest correlation of 0.9559 was obtained using learning

rate λ = 0.2 and one hidden layer with 4 activation units. 76

5.5 Comparison of 1-week ahead flu forecast model’s performance when

different learning rates and a varying number of hidden layers and

hidden units are used. The highest correlation of 0.929 was obtained

using learning rate λ = 0.2 and one hidden layer with 4 activation units. 76

6.1 Medical concepts in UMLS and example social media phrases that

describe the medical concept 83

6.2 Data Statistics after removing duplicates from the combined training,

validation, and test data 90

6.3 Examples of phrases with multiple labels 91

6.4 Data Statistics after removing concepts that had less than five examples 92
12

6.5 Medical concepts and similar words based on cosine similarity obtained

from word embeddings built with different health-related text corpora. 96

6.6 Classification Accuracy (%) using 10-fold cross validation (TH =

thesaurus, MD = medical dictionary, CT = clinical texts, HT =

health-related tweets, batch size = 50, number of epoch = 100, vector

dimension = 300) 97

6.7 Ablation Study. Comparison of models’ accuracy (%) when a feature is

removed from all possible feature sets (TH = thesaurus, MD = medical

dictionary, CT = clinical texts, HT = health-related tweets). The

numbers in parenthesis indicate the performance drop when the feature

is removed. 99

6.8 TwADR-L examples that should have multiple labels 100


13

List of Figures

2.1 Tweets related to Trending Topic Boone Logan. 23

2.2 System Architecture. 26

2.3 Web interface deployed for manual labeling. Annotators read the trend

definition and tweets before labeling trending topics as one of the 18

classes. 28

2.4 Distribution of 768 topics across 18 classes. 29

2.5 Word cloud of trending topics in technology class 30

2.6 Trending topic “macbook” and its 5 similar topics 32

2.7 Text-based accuracy comparison over different classification techniques. 34

2.8 Network-based accuracy comparison over different classification

techniques. 36

3.1 Time-series graph of daily allergy levels detected in tweets (February

2013 - April 2015). Only those allergy-related tweets labeled as positive

are used to create the graph. The graph illustrates the general allergy

level trend over time. The allergy level is the highest in mid–May, goes

down in June and July, starts rising again in August, and reaches its
14

local maximum point in mid–September. Similar seasonal patterns are

observed in both 2013 and 2014. 49

3.2 Monthly average data for allergy tweet count (blue), daily highest

temperature (green), and pollen level (red) for Washington state (March

2013 – April 2015). Pollen level is highly correlated with ∆temperature

(correlation of 0.776) and ∆tweet count (correlation of 0.706). Tweet

count is very strongly correlated with temperature (correlation of

0.668). 50

3.3 Monthly distribution of mentions of peanut and pollen allergies (March

2013–April 2015). A huge seasonal variation is observed in monthly

pollen allergy (a seasonal allergy) level compared to that of peanut

allergy (a food allergy). 51

3.4 Time-series graph of tweet count for various allergy symptoms (Feb

2013–Sep 2014). The most common allergy symptom is sneezing (blue

line) throughout the year, followed by cough (green) and runny nose

(sky blue). 53

3.5 Distribution of allergy tweets with geolocations. The seasonal pattern

of allergy levels across U.S. is clearly visible. Allergy level is the highest

in spring and the lowest in winter. 54

3.6 Bar chart comparing monthly social-media-sensed peanut and gluten

allergy levels for each U.S. state. The tweet count is normalized by state
15

census population and scaled to range between 0 and 100. In most US

states, peanut allergy level is higher than gluten allergy level. 55

4.1 Real-Time Disease Surveillance System continuously downloaded flu

and cancer related tweets and applied geographical, temporal, and

text mining. The real-time analysis data was visually reported as

U.S. disease activity maps, timelines, and pie charts on our project

websites [15][16]. 61

4.2 Our Real-Time Digital Flu Surveillance Website [16]. The ‘Daily Flu

Activity’ chart was an output of the temporal analysis and showed

the volume changes of tweets mentioning the word ‘flu’ over time.

The dramatic increase of flu tweet volume from Jan. 6 to Jan. 12

coincided with the dates when the major U.S. newspapers reported

Boston Flu Emergency [21] and deaths of four children from the AH3N2

influenza outbreak [20]. The‘U.S. Flu Activity Map’ was an output

of the geographical analysis and showed the weighted percentage of

tweet volumes mentioning ‘flu’ by states. The level of flu activity was

differentiated by different colors for an easy comparison of U.S. regional

flu epidemic. 62

4.3 Flu Symptoms Timeline. The timeline displays tweet volume changes

mentioning different flu symptoms from January through March 2013.

‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
16

level in mid January and decrease as the actual national ILI level by

CDC decreases. 64

4.4 Distribution of Cancer Types in Tweets. 65

4.5 Distribution of Cancer Symptoms in Tweets . 65

4.6 Distribution of Cancer Treatments in Tweets . 65

4.7 Most Frequent Words in Flu Tweets. 66

5.1 Data collection and modeling process. Disambiguation, filtering and

network analysis were performed on continuously downloaded flu-related

tweets. Weekly time-series flu-related tweet counts were computed after

data was smoothed out to align with CDC data. Current and 1-week

ahead flu prediction models were built. 73

5.2 Data available at current week t. At the end of week t, all flu-related

Twitter data collected during current week t and prior are available. At

time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as

CDC’s collection, retrospective analysis and reports take two weeks. 75

5.3 Structure of multilayer perceptron used in our influenza activity forecast

model. 78

5.4 Comparison of our current and 1-week ahead U.S. influenza activity

forecast results against CDC and Google Flu Trends data. For current

week prediction, a correlation coefficient of 0.9522 over 52 training data

and a correlation coefficient of 0.929 over 19 held-out test data points

were obtained. For 1-week ahead forecast, a correlation coefficient of


17

0.895 over 52 training data and a correlation coefficient of 0.71 over 19

previously unseen test data points were obtained. 79

6.1 Generic convolutional neural network architecture. 87

6.2 Generic recurrent neural network architecture. 89

6.3 Definition, example sentence, synonyms, related words, near antonyms

and antonyms for the word ‘sore’ obtained from Merriam-Webster

Thesaurus. 93

6.4 Medical definition of the term ‘myalgia’ obtained from Merriam-Webster

Medical Dictionary. 93
18

CHAPTER 1

Introduction

Social media has gained popularity as a new means for information sharing in the

last decade. The rise of social media along with advancements of mobile technologies

such as smart phones and tablets has changed communication patterns among friends

and families.

Twitter is one of the largest microblogging social network where people post short

text messages called tweets. Users can subscribe to receive tweets by following other

users they are interested in. If user A selects to receive all tweets posted by user B, A

is called a follower and B is called a friend of A. User A can follow user B back, but is

not obligated to do so. Users can select to share information publicly or privately within

small social circles. By default, tweets are publicly viewable by others unless the user

sets his/her Twitter account private, which makes Twitter a great real-time resource for

information search where the latest news and events can be found faster than any other

media. People generally like to find out news at the exact moment they are happening,

read/write information at their convenient time, and search what they want to know.

Users create hashtags, a pound sign (#) followed by a word or un-spaced phrase, to

dynamically tag user-generated posts, which allows searching tweets on a specific topic or

theme easy. Retweet is a unique feature of Twitter that allows users conveniently share

information with their followers thereby letting information propagate at a speed faster

than other traditional media. On social media, users share news, events, experiences,
19

and opinions on various topics. The language used in Twitter has several characteristics.

While the 140 character limit on tweet text makes it fun and exciting for users to post a

tweet, it also causes the language to be short, noisy, prone to misspelling, and frequent

use of emojis and acronyms. Also, users often use multiple languages within a same post.

These pose many challenges for an automated system to accurately interpret meanings

of the sentences. These are relatively new problems generated by the unique ways of

interacting and communicating on social media.

Social media has a wide scope of applications. In business, it can be used for brand

awareness, targeted marketing, customer engagement and product reviews. In politics,

social media has been widely used for presidential campaigns, fundraising, and to measure

public opinions. In healthcare, patients use social media and online health forums to

search medical answers, seek medical advice on treatments, and to connect with other

patients for emotional support.

Twitter tracks trending topics to identify popular topics of discussion. Trending topics

can be unique to a specific geographic location or time, and the popularity is measured

by the volume of tweets mentioning a specific keywords or hashtags. We classify trending

topics into general categories such as sports, news, music, science, technology, health, and

so on, to provide readers more context and help narrow down the search space. We explore

social network features (Twitter friend/follower network structure) as well as traditional

n-gram features for trending topic classification.

Mining social media for healthcare insights is a relatively new research area that

have emerged with the rapid growth of microblogging services in the last decade. We

built a real-time digital disease surveillance system that constantly collects, analyzes, and
20

visualizes the aggregated data. We studied distribution of disease types, symptoms and

treatments social media users talk about on three common diseases: cancer, allergy, and

influenza.

Cancer is a disease that involves abnormal cell growth and is among the leading causes

of death worldwide.1 In 2017, 1,688,780 new cancer cases and 600,920 cancer deaths are

projected to occur in the United States [94]. Allergy is another common disease a large

population suffers from and is caused by hypersensitivity of immune systems to genetic

and various environmental factors. Roughly 7.8% of people 18 and over in the U.S. have

hay fever, a common allergic condition also known as allergic rhinitis.2 Prior studies have

shown that allergy symptoms are highly associated with lost work productivity [64]. Early

detection and treatment support can help reduce lost work productivity and potentially

reduce the health care costs. Influenza is one of the most common viral infection which

affects lungs, nose and throat. It is a contagious disease that has similar symptoms as

cold but usually more severe, lasts longer, and can cause various complications leading

to deaths. In recent years, influenza activity tracking using social media has been a very

active area of research followed by Google Flu Trend that estimates prevalence of influenza

activity using aggregated google search query log data. Early detection of influenza levels

can help reduce the impact caused by pandemic and provide more time to prepare an

emergency response. Centers for Disease Control and Prevention (CDC) collects and

reports prevalence of influenza-like illness (ILI) based on physicians visits data across

the country with a two weeks time lag. We explored using Twitter posts mentioning

1https://www.cancer.gov/about-cancer/understanding/statistics
2
http://www.aaaai.org/about-aaaai/newsroom/allergy-statistics
21

symptoms of influenza as a real-time resource to track influenza levels and built neural-

network based real-time and 1-week ahead flu forecast models using both Twitter and

CDC data as features.

Users describe their health conditions, ask questions related to a certain disease or

treatment on social media. However, the colloquial nature of the languages used in social

media makes it difficult to automatically map the medical concepts present in the text to

standard medical terminologies. In addition, various ways of describing the same medical

condition poses an additional challenge for an automated system to understand the con-

texts. By mapping medical concepts in online user-generated texts to standard medical

ontology terms, automatic systems would be able to search relevant clinical resources such

as biomedical literatures for clinical question answering, extract treatment information,

and use the aggregated large-scale clinical data to track and detect disease spread for

population health.

This work demonstrates that social media is a useful resource to obtain health-related

information and the aggregated personal health information can be used for population

health management. Our main contributions are building automatic systems that 1)

classify trending topics and posts into general categories to help information search in

a specific domain such as health [1], 2) mine Twitter data as a real-time resource to

monitor disease (allergy, cancer, influenza) activities [2, 3, 4], 3) predict current and

future influenza levels by combining social media data with observed data from CDC

for features [5], and 4) normalize medical concepts described in user-generated texts to

standard medical ontology terms [6].


22

CHAPTER 2

Twitter Trending Topic Classification

2.1. Introduction

Twitter1 is an extremely popular microblogging site, where users search for timely

and social information such as breaking news, posts about celebrities, and trending topics.

Users post short text messages called tweets, which are limited by 140 characters in length

and can be viewed by user’s followers. Anyone who chooses to have other’s tweets posted

on one’s timeline is called a follower. Twitter has been used as a medium for real-time

information dissemination and it has been used in various brand campaigns, elections, and

as a news media. Since its launch in 2006, the popularity of its use has been dramatically

increasing. As of June 2011, about 200 million tweets are being generated every day.

When a new topic becomes popular on Twitter, it is listed as a trending topic, which

may take the form of short phrases (e.g., Michael Jackson) or hashtags (e.g., #election).
2
What the Trend provides a regularly updated list of trending topics from Twitter. It is

very interesting to know what topics are trending and what people in other parts of the

world are interested in. However, a very high percentage of trending topics are hashtags,

a name of an individual, or words in other languages and it is often difficult to understand

what the trending topics are about. It is therefore important to classify these topics into

general categories for easier understanding of topics and better information retrieval.

1
http://www.twitter.com
2http://www.whatthetrend.com
23

Figure 2.1. Tweets related to Trending Topic Boone Logan.

The trending topic names may or may not be indicative of the kind of information

people are tweeting about unless one reads the trend text associated with it. For example,

#happyvalentinesday indicates that people are tweeting about Valentines Day. A trend

named Boone Logan is indicative that tweets are about person named Boone Logan.

Anyone who does not follow American Major League Baseball (MLB), however, will not

know that the information is regarding Boone Logan, who is a pitcher for the New York

Yankees unless a few tweets are read from this trending topic as shown in Figure 2.1.

We found that trend names were not indicative of the information being transmitted

or discussed either due to obfuscated names or due to regional or domain contexts. To

address this problem, we defined 18 general classes: arts & design, books, business, charity

& deals, fashion, food & drink, health, holidays & dates, humor, music, politics, religion,

science, sports, technology, tv & movies, other news, and other. Our goal was to aid users

searching for information on Twitter to look at only smaller subset of trending topics by

classifying topics into general classes (e.g., sports, politics, books) for easier retrieval of

information.

To classify trending topics into these predefined classes, we proposed two approaches:

the well-known Bag-of-Words text classification and using social network information. In
24

this paper, we used supervised learning techniques to classify Twitter trending topics.

First, we employed a well-known text classification technique called Naive Bayes (NB)

[73]. A document in NB would model as the presence and absence of particular words.

A variation of NB is Naive Bayes Multinomial (NBM), which considers the frequency of

words and can be denoted as:

Y
(2.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd

where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-

ability of a document occurring in class c, and P (tk |c) is the conditional probability of

term tk occurring in a document of class c. A document d in our case is trend definition

or tweets related to each trending topic.

Apart from text-based classification, we also incorporated Twitter social network in-

formation for topic classification. For the latter we made use of topic-specific influential

users [78], which were identified using Twitter friend-follower network. The influence

rank was calculated per topic using a variant of the Weighted Page Rank algorithm [102].

In general, a tweeter is said to have high influence if the sum of the influence of those

following him/her is high. The key idea of the proposed network-based approach was to

predict the category of a topic knowing the categories of its similar topics. Similar topics

were identified using user-similarity metric, which was the cardinality of the intersection

of influential users between two topics ti and tj divided by the cardinality of top s influ-

encers of topic ti [78]. We experimented using different classifiers, for example, C5.0 (an

improved version of C4.5) [87], k-Nearest Neighbor (kNN) [23], Support Vector Machine

(SVM) [44], Logistic Regression [66], and ZeroR (the baseline classifier), and found that
25

C5.0 classifier resulted in the best accuracy on our data set. Experimental results showed

that both our approaches effectively classified trending topics with high accuracy, given

that it was a 18-class classification problem. This work was published in [1].

2.2. Related Works

A number of recent papers have addressed the classification of tweets. Sriram et al. [97]

classified tweets to a predefined set of generic classes such as news, events, opinions, deals,

and private messages based on author information and domain-specific features extracted

from tweets such as presence of shortening of words and slangs, time-event phrases, opin-

ionated words, emphasis on words, currency and percentage signs, “@username” at the

beginning of the tweet, and “@username” within the tweet. Genc et al. [49] introduced

a wikipedia-based classification technique. The authors classified tweets by mapping

messages into their most similar Wikipedia pages and calculating semantic distances be-

tween messages based on the distances between their closest wikipedia pages. Kinsella

et al. [60] included metadata from external hyperlinks for topic classification on a social

media dataset. Whereas all these previous works used the characteristics of tweet texts or

meta-information from other information sources, our network-based classifier used topic-

specific social network information to find similar topics, and used categories of similar

topics to categorize the target topic.

Sankaranarayanan et al. [91] built a news processing system that identified tweets

corresponding to late breaking news. Issues addressed in their work included removing

the noise, determining tweet cluster of interest using online methods, and identifying

relevant locations associated with the tweets. Yerva et al. [103] classified tweet messages
26

to identify whether they were related to a company or not using company profiles that

were generated semi-automatically from external web sources. Whereas all these previous

works classified tweets or short text messages into two classes, our work classified tweets

into 18 general classes such as sports, technology, politics, health, etc.

Becker et al. [27] explored approaches for distinguishing tweet messages between real-

world events and non-event messages. The authors used an online clustering technique

to group topically similar tweets together, and computed features that could be used to

train a classifier to distinguish between event and non-event clusters.

There had been a lot of research in sentiment classification of short text messages. Go

et al. [51] introduced an approach for automatically classifying sentiment of tweets with

emoticons using distant supervised learning. Pang et al. [80] classified movie reviews

determining whether a review was positive or negative. But none of these classified

Twitter trending topics.

2.3. Data and Methods

!
234+'(!
253'/&'(! 253'/&'(! !237$89#43/! 4*1&$!
=#$3(.5@! 25#&'&'(!
2.*&>!!! 2.*&>! ! %./31&'(!! 6*1&$! !
G! !!!!?#/@!(#(#! ,A4&>! ! ! !

"3H'&+.'! !! 25#&'&'(!
!
6*1&$! ! 25#&'&'(!
6*1&$!
!9A59355@! C#4D&.'!
! ! ! !
25#&'&'(!
&*#/! !!$3>D'.1.(@! !
6*1&$!
!
$.@!4$.5@!B! $E!F!,.E&34! !
! "#$#!%&'&'(!
?#/@! 4A*359.;1! 4*.5$4!
! !:3$;.5<89#43/!! )*+,&-#+.'!!
! (#(#! $.5'#/.! .$D35!'3;4! %./31&'(! #'/!0#1&/#+.'!
2;33$4! !
! !

"#$#!=.113>+.'! ?#931&'(! "#$#!%./31&'(! %#>D&'3!?3#5'&'(!

Figure 2.2. System Architecture.


27

As shown in Figure 2.2, the proposed classification system consisted of four stages:

Data Collection, Labeling, Data Modeling, and Machine Learning. In our experiments, we

used two data modeling methods: (1) Text-based data modeling, and (2) Network-based

data modeling.

2.3.1. Data Collection

The website What the Trend provides a regularly updated list of ten most popular topics

called “trending topics” from Twitter. A trending topic may be a breaking news story

or it may be about a recently aired TV show. The website also allows thousands of

users across the world to define, in a few short sentences, why the term is interesting or

important to people, which we refer to as “trend definition”. The Twitter API3 allows

high-throughput near real-time access to various subsets of public Twitter data. We

downloaded trending topics and definitions every 30 minutes from What the Trend and

all tweets that contained trending topics from Twitter while the topic was trending.

All the tweets containing a trending topic constituted a document. For example, while

the topic “superbowl” was trending, we kept downloading all tweets that contained the

word “superbowl” from Twitter, and saved the tweets in a document called “superbowl”.

In case a tweet contained more than two trending topics, the tweet was saved in all

relevant documents. For example, if a tweet contained two trending topics “superbowl”

and “NFL”, the same tweet was saved into two documents called “superbowl” and “NFL”.

From 23000+ trending topics that we had downloaded since February 2010, we randomly

selected 768 topics as our dataset.


3https://dev.twitter.com/
28

Figure 2.3. Web interface deployed for manual labeling. Annotators read
the trend definition and tweets before labeling trending topics as one of the
18 classes.

2.3.2. Labeling

We identified 18 classes for topic classification. The classes were art & design, books,

charity & deals, fashion, food & drink, health, humor, music, politics, religion, holidays

& dates, science, sports, technology, business, tv & movies, other news, and other. Since

Twitter is a primary source of news or information, the news related to political events
29

%+*"
%'!"

%'*"

%#*"

%**"
,-./01"23"425678"

&#"

!)"
!*" ($" ()"

+*"
$#"
'&"

'*"

#$" #("
##" #'"
%!" %&" %("
#*" %$" %)"
!" !"

*"

Figure 2.4. Distribution of 768 topics across 18 classes.

were classified as politics. If the topic was about news that was not in any of the categories,

it was classified as other news. If the trend definition or tweet text was gibberish or if it

was in a language other than English, then we classified the topic as other category. The

data was labeled by reading topic’s trend definition and few tweets.

We used two annotators to label all topics. In case of disagreement, a third annotator

intervened. For the labeling task, a random sample of 1,000 topics was selected. From

the 1,000, we narrowed the data set down to 768 topics for mainly two reasons. First, the

topic had no trend definition. Second, the third annotator could not finalize the label.

For each of the 768 topics in our dataset, its five most similar topics were also labeled,
30

which were required for the network-based modeling as described in Section 2.3.3.2. We

ended up manually labeling 3,005 topics because some of the similar topics were common

to more than one topic. Figure 2.3 shows the web interface we deployed for the labeling

task.

Figure 2.5. Word cloud of trending topics in technology class

The distribution of data over the 18 classes is provided in Figure 2.4. The sports

category had the highest number of topics (19.3%), followed by other category (12%).

Except for categories other news, tv & movies, and music, all other categories contained

less than 6.8% of the topics. Figure 2.5 shows examples of trending topics that were

classified as technology.

2.3.3. Data Modeling

2.3.3.1. Text-based Data Modeling. In order to use text-based document models,

the data which comprised of topic’s trend definition, tweets and label was processed in
31

two stages. In the first stage, for each topic, a document was created from trend defini-

tion and varying numbers of tweets (30, 100, 300, and 500). From the document text,

all tokens with hyperlinks were removed. This document was then assigned a label corre-

sponding to the topic. In the next stage, the document was run through a string-to-word

vector kernel, which consisted of two components. The first component was the tokenizer

that removed delimited characters and stop words. We used a customized stop words list

catered to Twitter lingo4. The second component transformed the tokens into tf-idf (term

frequency–inverse document frequency) weights [73]. Here, we experimented with up to

top 500 and 1,000 frequent terms per category. For each of the 18 labels, top most fre-

quent words with their tf-idf weights were used to build the dataset for machine learning

in the next step.

2.3.3.2. Network-based Data Modeling. As an alternate to text-based data model-

ing, in network-based data modeling we used Twitter specific social network information.

An interesting aspect of Twitter network structure is that a linkage indicates common

interest between two users and is directed and asymmetric. User A can freely choose to

follow user B without B’s consent and B does not necessarily have to follow A. We used

the algorithm from User Similarity Model [78] to find five most similar topics for trend-

ing topic X. The algorithm used the class of similar topics that were manually labeled

in section 2.3.2 to predict the class of topic X. In user similarity model, topic-specific

influential users were computed using Twitter social network information such as tweet

time, number of tweets made on a topic, and friend-follower relationship. Then, using

4http://www.twithawk.com
32

/010#$2!'+30"!

'(")*+#+,-!

!!" /010#$2!'+30"!
/010#$2!'+30"!

'(")*+#+,-! !#" 4$2,('!'+30"! !!"


'(")*+#+,-!

!!!!!!!!"#$%%&!

!!" !!"

/010#$2!'+30"! /010#$2!'+30"!

'(")*+#+,-! $%&'()*"+".($#%!

Figure 2.6. Trending topic “macbook” and its 5 similar topics

the number of common influential users between two topics, most similar topics were cal-

culated. Although the user similarity model captured different dimensions of similarity

such as temporal and geographical, our assumption was that a majority of the similar

topics would fall into the same category as the target topic and hence we could predict

the category of target topic using the categories of its similar topics.

Table 2.1 and Figure 2.6 show an example of the topic “macbook”, its five most similar

topics, and number of common influential users between topic “macbook” and its similar

topics. Trending topic “macbook” was classified as technology by manual labeling, and

its five most similar topics (“iwork”, “magic trackpad”, “#landsend”, “apple ipad” and
33

“mobileme”) were manually labeled as technology, technology, charity & deals, technology,

technology. The numbers in Fig. 2.6 indicate the number of common influential users who

tweeted about both “macbook” and its similar topic. The resulting data for machine

learning in this case consists of 768 rows and 19 columns. Each row represents a trending

topic. 18 columns represent 18 classes and the last column represents the class label. Since

topic “macbook” has four similar topics in technology, sum of four values of common

influential users corresponding to its similar topics in technology (11+11+11+10=43)

becomes the value for row “macbook” and column technology in the table. And the value

corresponding to its similar topic “#landsend” becomes the value for row “macbook” and

column charity & deals.

Table 2.1. Five most similar topics of topic “macbook” in class technology.

Similar Topic Y Class of Topic Y Number of Common Influential Users


between Topics X and Y
iwork technology 11
magic trackpad technology 11
#landsend charity & deals 11
apple ipad technology 11
MobileMe technology 10

2.3.4. Machine Learning

The two datasets constructed as a result of the two approaches in the Data Modeling

stage were used as inputs to machine learning stage. We built predictive models using

various classification techniques and selected the ones that resulted in the best classifica-

tion accuracy. The experimental results are discussed in next section.


34

2.4. Experiments and Results

For our experiments, we used popular tools such as WEKA [100] and SPSS mod-

eler [56]. WEKA is a widely used machine learning tool that supports various modeling

algorithms for data preprocessing, clustering, classification, regression and feature selec-

tion. SPSS modeler is another popular data mining software with unique graphical user

interface and high prediction accuracy. It is widely used in business marketing, resource

planning, medical research, law enforcement and national security. In all experiments, 10-

fold cross-validation was used to evaluate the classification accuracy. The ZeroR classifier

which simply predicts the majority class was used to get a baseline accuracy.

2.4.1. Text-based classification

),# $!%"$#
$"%&"#
$(%)$#
!&%*(#
$,#
!"# !'#

!,# '!%"(#
''%!#
'+%*"#
!""#$%"&'()*'

',#

",#

(&%+)#
+,#

(,#

,#

Figure 2.7. Text-based accuracy comparison over different classification


techniques.
35

Using Naive Bayes Multinomial (NBM), Naive Bayes (NB), and Support Vector Ma-

chines with linear kernels classifiers (SVM-L), we found that the accuracy of classification

is a function of number of tweets and the frequent terms. Fig. 2.7 presents the comparison

of classification accuracy using different classifiers for text-based classification. TD repre-

sents the trend definition. Model(x,y) represents classifier model used to classify topics,

with x number of tweets per topic and y top frequent terms. For example, NB(100,1000)

represents the accuracy using NB classifier with 100 tweets per topic and 1,000 most

frequent terms (from text-based modeling result).

In our experiments, NB model always provided a lower accuracy over NBM model

because it models the word counts and adjusts the underlying calculations. SVM-L per-

formed better than NB but had slightly lower accuracy compared to NBM. If only trend

definition was used, irrespective of the most frequent word terms, the accuracy was much

lower for all three classifiers compared to using trend definition plus tweets. The experi-

mental results suggested that NBM classifier using text from trend definition, 100 tweets,

and a maximum of 1,000 word tokens per category gave the best accuracy of 65.36%.

2.4.2. Network-based classification

Fig. 2.8 compares classification accuracy of different algorithms for network-based classifi-

cation. Clearly, C5.0 decision tree classifier gave the best classification accuracy (70.96%)

followed by k-Nearest Neighbor (63.28%), Support Vector Machine (54.349%), and Logis-

tic Regression (53.457%). C5.0 decision tree classifier achieved 3.68 times higher accuracy

compared to the ZeroR baseline classifier. The 70.96% accuracy was very good consider-

ing that we categorized topics into 18 classes. To the best of our knowledge, the number
36

*"'

!"#$%&'
!"'
%(#)*&'
%"'
+,#(,&' +(#,+&'
+"'
!""#$%"&'()*'

,"'

("'

)"' ,$#",&'

-"'

"'
.'+#"' /01234256' <=>>;46'?2@6;4' C;875D@' F24;E'
12789:;4' A3@97B2' E2842557;B'

Figure 2.8. Network-based accuracy comparison over different classification


techniques.

of classes used in our experiment was much larger than the number of classes used in any

earlier research works (binary classification is the most common).

2.5. Summary

In this paper, we explored two different classification approaches for Twitter trending

topic classification. Apart from using text-based classification, our key contribution is

the use of social network structure rather than using just textual information, which

can be often noisy given the context of social media such as Twitter due the heavy use

of Twitter lingo and the limit on the number of characters that users are allowed to

generate for their messages. Our results show that network-based classifier performed

significantly better than text-based classifier on our dataset. Considering tweets are not

as grammatically structured as regular document texts, text-based classification using


37

Naive Bayes Multinomial provides fair results and can be leveraged in cases where we

may not be able to perform network-based analysis.


38

CHAPTER 3

Mining Social Media Streams to Improve Public Health

Allergy Surveillance

3.1. Introduction

Allergy is the fifth most common chronic diseases in the United States1. The complex-

ity and severity of allergic diseases are increasing worldwide [82]. One in five Americans

have either allergy or asthma symptoms. In 2012, 7.5% of adults (17.6 million adults) and

9% of children (6.6 million children) were diagnosed with hay fever [30, 29]. Continuous

use of allergy medication can worsen patients’ health conditions and lead to side effects

and other serious medical complications. Furthermore, an increasing number of allergy

patients gives rise to allergy-related health care cost and leads to reduced work produc-

tivity. $7.9 million is annually spent on allergy-related health care systems and business.

4 million workdays are lost due to hay fever each year. Therefore, an accurate allergy

surveillance and forecast is important to minimize the healthcare cost and maximize work

productivity lost due to allergy symptoms.

Twitter, one of the largest social networking website, allows users to post short text

messages called tweets that can be up to 140 characters in length. Twitter has over

328 million monthly active registered users. Twitter has been used as a valuable real-

time information resource for various applications. For instance, Twitter data have been

1http://www.webmd.com/allergies/allergy-statistics
39

used to detect earthquakes in Japan [89], predict the stock market [33] and for an in-

depth study of 2011 Egyptian Revolution [10]. On Twitter, people not only make general

chatters but also share photos, news, opinions, emotions, and even health conditions

including symptoms and medications they are taking for their diseases. In recent years,

many researchers have investigated using Twitter for disease surveillance, especially for

influenza epidemic detection and prediction [81, 39, 22, 96, 26, 38, 65, 93, 69].

In this paper, we mined a large scale Twitter data collected over 28 months to monitor

allergy levels. More specifically, 1) a bag-of-words supervised learning approach was

employed to distinguish tweets that mentioned actual incidents of allergy from those that

talked about news or general awareness about allergy, 2) text-mining techniques such as

n-gram extraction and part-of speech tagging were applied to extract predominant allergy

types, and 3) a spatiotemporal mining was applied to track allergy levels over time and

space.

We believe that our work is the first framework towards real-time allergy surveillance

using a fine-grained spatiotemporal analysis on a large-scale social media data. The data

analysis results reveal that Twitter is an excellent resource for detecting allergy prevalence.

Our proposed system can help see the past and current trend of allergy levels detected

in social media stream. The real-time analysis results are updated on our allergy project

website [14]. This work was published in [4].


40

3.2. Our Approach

3.2.1. Datasets

3.2.1.1. Twitter dataset. We collected allergy-related tweets from public tweet stream

using Twitter’s streaming API2. We collected over 6.3 million tweets that mentioned

‘allergy’ or ‘allergies’ created by over 3.1 million unique users over 28 months from January

2013 to April 2015. Some talked about their allergy symptoms (e.g., Walked out of my

house confused as to why my eyes felt like they were on fire and then I realized it’s allergy

season.) while others talked about allergy types (e.g., I sneezed like eight times in a row.

This pollen allergy is killing me.) or allergy treatments/medication they took (e.g., sitting

in doctor’s office just to get an allergy shot.).

3.2.1.2. Ground Truth Data.

Pollen dataset. We collected monthly average pollen levels and 90 day historic pollen

levels for U.S. major cities from pollen.com3. The pollen level is a number between 0

and 12 and divided into five categories: 0.0-2.4 (low), 2.5-4.8 (low-med), 4.9-7.2

(medium), 7.3-9.6 (med-high), 9.7-12.0 (high).

Climate dataset. Climate Data Online (CDO)4 provides free access to National

Climatic Data Center (NCDC)’s archive of global historical weather and climate data.

We collected daily and monthly temperature and precipitation data generated since

January 2013 (because the earliest allergy-related Twitter data we had was generated in

January 2013) for major U.S. cities and states. More than a half of the climate data

2
https://dev.twitter.com/docs/streaming-apis
3
http://www.pollen.com/
4
http://www.ncdc.noaa.gov/cdo-web/
41

collecting stations did not report daily temperatures at all, and many, among those that

did report temperature, had missing values.

Allergy patients’ dataset. We used data from the first Quest Diagnostics Health

Trends allergy report, Allergies Across America5. This report is the largest analysis of

allergy testing of patients in the United States under the evaluation for medical

symptoms associated with allergies. We collected a ranked list of most prevalent food

allergies grouped by patients’ ages and a ranked list of the worst U.S. cities for different

allergy types.

3.2.2. Methodology

3.2.2.1. Data Preprocessing. As we were interested in messages that mentioned actual

allergy incidents, we removed all retweets (20.51% of our initial dataset) and tweets that

were not written in English (2.9% of our initial data set). Special HTML characters were

replaced with human-readable characters (e.g., replaced &lt; with < (i.e., less-than sign),

replaced &gt; with > (i.e., greater-than sign)) and all hyperlinks were replaced with string

‘URL’.

3.2.2.2. Data Classification. While some tweets talked about a person having allergy

symptoms, other tweets talked about news, questions, general awareness of allergy sea-

son or information/advertisement regarding allergy medicine/treatments. It is important

to distinguish tweets that mention actual allergy incidents to infer precise allergy levels.

Hence, we classified tweets into two classes. First, we manually labeled 2,000 randomly

selected tweets into positive and negative. A tweet was labeled as positive if it talked
5
https://www.questdiagnostics.com/dms/Documents/Other/2011_QD_AllergyReport.pdf
42

Table 3.1. Tweets with positive and negative labels. A tweet is positive if
it talks about the author or someone around the author having allergy. A
tweet is negative if it is a question or talks about news, general awareness
or information about allergies.
Positive(+1) Tweet
Negative(-1)
+1 My allergies are going insane today.
(Author has allergy)
+1 Stupid allergies not letting me sleep.
(Author has allergy)
+1 Recently my lovely allergy to cats has led to my throat clos-
ing up n barely being able to breathe.
(Author has allergy)
+1 I never been able to enjoy spring cause my allergies. I hate
having itchy eyes and running nose.
(Author has allergy)
+1 @user1 @user2 and @user3 are all dying because of their
allergies.. and Im just sitting here.. #popapill
(People around author have allergies)
-1 In the United States, around 15 million people have food al-
lergies, according to Food Allergy Research and Education.
(News)
-1 Does anyone know good food near Happy Hollow that has
vegetarian options and is easy for seafood allergies?
(General question)
-1 Notice the increase in allergy ads on TV? Yep, spring is
around the corner.
(Awareness about spring season)
-1 RT @CureAllergies: What You Should Do To Manage Your
Allergies - URL.
(Information for allergy management)

Table 3.2. Classification performance of various classifiers using 10-fold


cross validation. The best classification performance (F-measure of 0.811
and ROC area of 0.905) was obtained using NaiveBayesMultinomial (NBM).
Classifier Precision Recall F-measure ROC Area
NBM 0.811 0.811 0.811 0.905
NB 0.799 0.793 0.793 0.864
Random Forest 0.812 0.800 0.799 0.888
SVM 0.818 0.810 0.809 0.814
43

about the author or someone around the author having allergy symptoms. A tweet was

labeled as negative if it talked about news, advertisement, or general awareness of al-

lergies. Table 3.1 shows example tweets with positive and negative labels. The text

in parenthesis indicates the reason for the positive or negative annotation. We used a

bag-of-words text classification where n-grams in documents were used as features. We

removed common stop words except the pronouns I, me, my, you, and your because we

found that these pronouns were important features in classifying tweets into positive and

negative examples of actual incident of allergy. To create features, we applied Weka [53]’s

StringToWordVector filter. All unigrams, bigrams, and trigrams were used to construct

the feature vector if they appeared at least twice in the training data. Then the filter

converted words into their stems, applied TF-IDF weighting scheme, and kept 500 most

frequently used n-grams in the final feature vector. We then explored four different ma-

chine learning algorithms (NaiveBayes (NB), NaiveBayes Multinomial (NBM), Random

Forest (RF), Support Vector Machine (SVM)) that are commonly used for text classifica-

tion. In our classification task, both precision and recall were equally important. Thus,

F-measure and ROC area were used to compare performance of classification algorithms.

As shown in Table 3.2, the best classification performance (F-measure of 0.811 and

ROC area of 0.905) was obtained using NBM and 10-fold cross validation on labeled data.

We built a model using NBM on our training set, and classified all remaining tweets (after

removing retweets and tweets in non-english) into positive or negative. We used NBM

because it had the best performance on our training data, and several prior works had

shown that NBM outperformed other classification algorithms. For example, McCallum

and Nigamcite [75] found NBM to outperform simple NB, especially at larger vocabulary
44

sizes, and Lee et. al. [1] showed that the performance of NBM was better than that of NB

or SVM in 18-class tweet text classification. In our entire allergy corpus, 63% of tweets

were classified as positive and 37% were classified as negative. Only tweets in positive class

(i.e., tweets classified as mentions of actual allergy incidents) were used for our analysis.

TF-IDF (term frequency–inverse document frequency) [73]. The tf-idf measure allows

us to evaluate the importance of a word to a document. The importance is proportional

to the number of times a word appears in the document but is offset by the frequency of

the word in the document. Thus tf-idf is used to filter out common words.

NaiveBayes Multinomial (NBM) [75]. A document in NB would model as the

presence and absence of particular words. A variation of NB is Naive Bayes Multinomial

(NBM), which considers the frequency of words and can be denoted as:

Y
(3.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd

where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-

ability of a document occurring in class c, and P (tk |c) is the conditional probability of

term tk occurring in a document of class c.

3.2.2.3. Text Mining. We wanted to investigate whether we could automatically dis-

cover the most predominant allergy types that people suffer from or talk about on social

media by examining the texts in Twitter posts. From our allergy-related tweet corpus, we

extracted most frequently occurring bigrams where the second word was ‘allergy’. N-gram

is a contiguous sequence of n words in a sequence of text. N-gram models are widely used

in statistical natural language processing.


45

Table 3.3. A list of most frequently used bigrams where the second word is
allergy, ranked by frequency of use in the entire allergy corpus. It includes
many actual allergy types that are in ‘noun noun’ POS tag.
Rank Most Frequently Used 2-grams POS-tag
1. food allergy noun noun
2. peanut allergy noun noun
3. gluten allergy noun noun
4. nut allergy noun noun
5. natural allergy adjective noun
6. hate allergy verb noun
7. skin allergy noun noun
8. lower allergy comparative-adjective noun
9. cat allergy noun noun
10. milk allergy noun noun
11. issues allergy verb noun
12. worst allergy superlative-adjective noun
13. dog allergy noun noun
14. severe allergy adjective noun
15. pollen allergy noun noun

Part-Of-Speech (POS) tagging is a process of tagging a word with a part-of-speech

(lexical category) such as noun, pronoun, verb, adjective, etc. We applied POS tagging

to each bigram. For example, the POS tag for string ‘natural allergy’ is ‘adjective noun’

and the POS tag for string ‘peanut allergy’ is ‘noun noun’. Table 3.3 shows the list of

15 most frequently used bigrams and corresponding POS tags in the descending order of

frequency of use.

Our assumption was that the POS tag of all allergy types (e.g., food allergy, nut

allergy, pollen allergy, dust allergy, egg allergy) should be in the form of ‘noun noun’ and,

therefore, we could obtain a list of allergy types by removing all bigrams that were not in

‘noun noun’ form. In other words, we needed to remove all bigrams that contained non-

nouns (e.g., natural allergy (adjective noun), worst allergy (superlative-adjective noun))
46

to get the final list of allergy types. All bigrams that contained Twitter screen name (e.g.,

@username), stop words or non-english words were also removed.

3.2.2.4. Spatio-temporal Mining. Every tweet comes tagged with a timestamp that

indicates the time when the tweet is posted. For example, the timestamp ‘Sun Mar 02

05:55:02 +0000 2014’ indicates that the tweet is created on Sunday, March 2, 2014 at

5:55am GMT (Greenwich Mean Time). Since we were interested in tracking allergy levels

over time, we used the timestamps to count the volume of tweets posted each day that

mentioned allergy or a specific allergy type, symptom, or treatment.

There are two types of tweet location, a sensor-based geolocation and a text-based user

profile location. A geolocation provides the exact location where the tweet was posted

with latitude and longitude values. This data is available to others only if the Twitter

user selects it to be publicly available. Twitter users can identify home location in his/her

Twitter user profile. We examined user profile locations and extracted state information.

Examples of users’ home locations that had state information were ‘Riverside, CA’, ‘some-

where in NY’ and ‘Gainesville, Florida’. Examples of home locations that lacked state

information were ‘Home Sweet Home’, ‘Somewhere over the rainbow’ and ‘Traveling’.

We tagged each tweet with a 2-character state code (e.g., CA for California) if we were

successful extracting the state information from Twitter user profile.

Some tweets had both geolocation and user profile location, some had one or the other,

and the rest did not have any location information. Geolocations were first translated

into human-readable addresses using reverse geocoding API6 and then the state name was

extracted from the address. For tweets that did not have geolocation, we obtained state

6https://developers.google.com/maps/documentation/geocoding/
47

name from the user profile. Those that did not have any of the two locations were not

used in the spatial analysis.

Table 3.4. 30 most frequently mentioned allergy types automatically ex-


tracted by our algorithm. Numbers indicate the rank of frequency the 2-
gram appears in the allergy corpus and +/- signs indicates whether it is an
actual allergy type(+) or not(-). 26 out of 30 were true positives achieving
a precision of 86.7%.
Rank Allergy Types Rank Allergy Types
1. food allergy (+) 16. shellfish allergy(+)
2. peanut allergy (+) 17. claritin allergy(-)
3. gluten allergy (+) 18. drug allergy(+)
4. nut allergy (+) 19. eye allergy (+)
5. skin allergy (+) 20. asthma allergy (-)
6. cat allergy (+) 21. sun allergy (+)
7. milk allergy (+) 22. mucinex allergy (-)
8. dog allergy (+) 23. prescription allergy (-)
9. pollen allergy(+) 24. nickel allergy (+)
10. spring allergy(+) 25. meat allergy (+)
11. latex allergy (+) 26. bee allergy (+)
12. dairy allergy (+) 27. alcohol allergy (+)
13. dust allergy (+) 28. seafood allergy (+)
14. egg allergy (+) 29. mite allergy (+)
15. wheat allergy(+) 30. penicillin allergy (+)

Table 3.5. Most prevalent food allergies. The rank of the most prevalent
food allergies extracted from Twitter data is very similar to that obtained
from actual allergy patients’ data.
Ground Truth Twiiter Data
Rank Most prevalent food allergies (Age>10) Rank Most mentioned food allergies
1. peanut allergy 1. food allergy
2. wheat allergy (gluten allergy) 2. peanut allergy
3. soybean allergy 3. gluten allergy
4. milk allergy 4. nut allergy
5. egg allergy 5. milk allergy
6. dairy allergy
7. egg allergy
8. wheat allergy
48

3.3. Experimental Results

3.3.1. Text Analysis

Allergy Types. Instead of using a pre-defined keyword list, we automatically identified

allergy types mentioned in our dataset by using natural language processing methods.

For the ground truth data, we created a list of allergy types by combining data from

multiple online resources7. Table 3.4 lists the top 30 most frequently mentioned allergy

types extracted from our allergy corpus by applying methods described in section 3.2.2.3.

The numbers indicate the rank of frequency (1 means the highest frequency, 30 means the

lowest frequency). The signs in the parenthesis indicate whether the extracted allergy type

is positive (an actual allergy type) or negative (not an actual allergy type). Out of top 30

allergy types, 26 were true positives and only 4 were false positives, leading to precision

of 86.7%. Two of the four false positive cases (claritin, mucinex) were allergy medicines,

and the other two cases were allergy-related disease (asthma) and term (prescription).

The traditional method that uses a pre-defined keyword list often fails to identify new

types of diseases, and new keywords (i.e., new disease types) have to be manually added.

However, with our proposed method that automatically identifies disease types, we would

not need the step where new disease types are manually added.

Most Prevalent Food Allergies. We further evaluated our Twitter data analysis

results by comparing it to the real-world allergy patients’ data. Table 3.5 shows the

ground truth value of the most prevalent food allergies in allergy patients in the first

column and the list of most mentioned food-related allergy types from table 3.4. We
7http://www.foodallergy.org/allergens, http://www.webmd.com/allergies/guide/
allergy-symptoms-types, http://acaai.org/allergies/types, http://www.healthline.com/
health/allergies/alcohol
49

used the data for patients older than age ten because most Twitter users fell into this

age group. The allergy types in two columns show that they are in a very similar order

of ranking. Note that gluten and wheat allergy can be considered to be the same and

milk and diary allergy can also considered to be the same. This proves not only how the

extracted allergy types are precise in identifying actual allergy types, but also the rank of

prevalent allergy types have a very strong relationship to the real-world allergy patients’

data.

Figure 3.1. Time-series graph of daily allergy levels detected in tweets (Feb-
ruary 2013 - April 2015). Only those allergy-related tweets labeled as posi-
tive are used to create the graph. The graph illustrates the general allergy
level trend over time. The allergy level is the highest in mid–May, goes
down in June and July, starts rising again in August, and reaches its local
maximum point in mid–September. Similar seasonal patterns are observed
in both 2013 and 2014.
50

Figure 3.2. Monthly average data for allergy tweet count (blue), daily high-
est temperature (green), and pollen level (red) for Washington state (March
2013 – April 2015). Pollen level is highly correlated with ∆temperature
(correlation of 0.776) and ∆tweet count (correlation of 0.706). Tweet count
is very strongly correlated with temperature (correlation of 0.668).

3.3.2. Spatio-Temporal Analysis

In temporal model, we tracked activities of allergy, various allergy types, symptoms and

medications over time using tweet timestamps. Figure 3.1 shows the allergy-related tweet

volume changes over two-years period from February 2013 through April 2015. The

allergy level reaches its annual global maximum in mid-May and a local maximum in mid-

September and this seasonal pattern is observed in both 2013 and 2014. The increased

number of people chatting about their allergies in May and in September indicates that a
51

Figure 3.3. Monthly distribution of mentions of peanut and pollen allergies


(March 2013–April 2015). A huge seasonal variation is observed in monthly
pollen allergy (a seasonal allergy) level compared to that of peanut allergy
(a food allergy).

very large population suffers from spring allergies such as tree pollen allergies and there

is also a quite large population that has allergy symptoms in the fall.

To validate our experimental results, we compared our Twitter data against the actual

pollen levels and the weather data. Because pollen levels and temperatures vary depending

on location, we partitioned allergy-related Twitter data into a finer space granularity (U.S.

state level). Figure 3.2 compares three trend-lines: allergy tweet timeline (blue), monthly

average pollen level (red), and monthly mean max temperature (green) for Washington

state. We show the data for Washington state, not just because a large volume of allergy-

related tweets were generated in WA but also because the ground truth temperature
52

data for WA was available for all dates from March 2013 through April 2015. It is clear

from the graph that all three trend lines illustrate seasonality. An interesting pattern is

that there is an order in time of three trend lines reaching their maximum and minimum

points. The pollen level starts rising first and reaches its peak, followed by tweet counts

and temperature. The trend lines also decrease in the same order.

Our analysis shows that the pollen level is highly correlated with the rate of tempera-

ture change (correlation of 0.776) as well as the rate of tweet count change (correlation of

0.706). In other words, pollen level reaches its peak point when the temperature sharply

increases in spring and, at the same time, allergy-related tweet volume also sharply in-

creases. Also, tweet count has a strong correlation with daily temperature (correlation

of 0.688), meaning allergy tweet count increases as the temperature increases. The high

correlation values show how well the social media data reflects the real-world allergy

activities and hence can be a good source of health data information.

In Figure 3.3, we show how the trend of mentions of two different allergy types differ

over time. The tweet volume mentioning ‘pollen allergy’ (a seasonal allergy) rises very high

during the spring and the fall and remains very low in the summer. However, unlike pollen

allergy, the tweet volume mentioning ‘peanut allergy’ (a food allergy) stays relatively

constant throughout the year. Note that we also carried out the same experiment in U.S.

state level and observed similar patterns in each state. This observation implies that

the seasonality observed in overall allergy dataset in figure 3.2 and figure 3.3 comes from

tweets mentioning various seasonal-allergy-related terms such as spring, tree pollen, or

hay fever, rather than terms related to non-seasonal allergies such as dog, cat, milk or

egg.
53

Figure 3.4. Time-series graph of tweet count for various allergy symptoms
(Feb 2013–Sep 2014). The most common allergy symptom is sneezing (blue
line) throughout the year, followed by cough (green) and runny nose (sky
blue).

Figure 3.4 is a time-series graph showing tweet volume changes for different allergy

symptoms. Sneezing (blue) is the most common allergy symptom throughout the year,

followed by cough (green), runny nose (sky blue), watery eyes (red), and itchy throat

(turquoise). It is very interesting that the rank for different allergy symptoms on each day

is consistent throughout the year. Note that the percentage of Twitter users who enable

their location publicly available has been steadily increasing since we started collecting

our data.
54

(a) Feb 2013 (b) May 2013 (c) Aug 2013 (a) Nov 2013

Figure 3.5. Distribution of allergy tweets with geolocations. The seasonal


pattern of allergy levels across U.S. is clearly visible. Allergy level is the
highest in spring and the lowest in winter.

For 20% of the tweets in our allergy data set, we were able to identify U.S. state

names. 11.4% of those had actual geolocation (longitude and latitude) values. For the

remaining 88.6%, state names were extracted from the user profile locations.

Figure 3.5 shows monthly snapshots of tweets with geolocations that helps us visualize

allergy levels across the U.S. We show quarterly seasonal maps for 2013. Each red dot

on the map represents a tweet that was posted from the location. This map shows a

general spatiotemporal trend of allergy activities. The allergy level starts increasing in

early spring and gets extremely severe in May. It remains high throughout the summer,

and goes down in the fall. Interestingly, most allergy-related tweets come from the eastern

part of the country although there are some from the west coast.

Next, using the U.S. state information we obtained from geolocations and user profile

locations, we visualized the distribution of tweets that mentioned different allergy types.

Figure 3.6 compares levels of peanut allergy (blue bar) and gluten allergy (red bar) de-

tected by social media sensors for each U.S. state. Because a greater number of tweets

were generated from states that had larger population, tweet counts were normalized by

state population and scaled to range between 0 and 100. Kansas had the highest level of

peanut allergy (94.51). South Dakota had the lowest level of both allergy types (3.85 for
55

Figure 3.6. Bar chart comparing monthly social-media-sensed peanut and


gluten allergy levels for each U.S. state. The tweet count is normalized by
state census population and scaled to range between 0 and 100. In most
US states, peanut allergy level is higher than gluten allergy level.

peanut allergy and 0 for gluten allergy). Most states had higher levels of peanut allergy

than gluten with a few exceptions. For example, unlike most other states, Oregon (OR),

Delaware (DE), and Montana (MT) had higher gluten allergy levels.

3.4. Related Works

Before the Internet was widely used, over-the-counter pharmaceutical sales data [72]

and telephone triage data [47] were among the methods that were used for surveillance

of diseases.

Disease Surveillance using online data. In the past decade, with the dra-

matic increase of internet use, online data had been extensively used to retrieve health

information and to detect disease activities. Web search queries data had been studied
56

to track influenza activities. Ginsberg et al. [50] used flu-related google search queries

data to estimate current flu activity near real time, 1-2 weeks in advance of the records

by the traditional flu surveillance system8. Recent research on public health and dis-

ease surveillance using online data have mostly focused on monitoring and predicting

influenza levels. Researchers had used Twitter data to monitor influenza outbreak and to

predict flu activities. Signorini et al. [95] attempted estimating current influenza activity

by tracking public sentiment and applying support vector machine algorithm on Twitter

data generated during the Influenza A H1N1 pandemic. Chew et al. [41] analyzed the

contents and sentiment of tweets generated during the 2009 H1N1 outbreak and showed

the potential and feasibility of using social media to conduct infodemiology studies for

public health. There are many others who have used Twitter data for flu outbreak detec-

tion [81, 39, 22, 96, 26, 38, 65, 93, 69]. Unlike earlier researchers who used Twitter

for flu activity detection and prediction, to the best of our knowledge, our work was the

first attempt examining allergy activities using a large scale Twitter stream.

Tweet Classification. Aramaki et al. [24] proposed a Twitter-based influenza epi-

demics detection method that used Natural Language Processing (NLP) to filter out

negative influenza tweets. Tuarob et al. [99] used ensemble machine learning techniques

to identify health-related messages in a heterogenous pool of social media data. In our

work, we used bag-of-words model and explored using four different machine learning

algorithms to find the best model to classify tweets into those that mention actual allergy

incidents and those that mention general awareness or information about allergy season.

8http://www.cdc.gov/flu/
57

Study of relationship between weather, pollen, and allergy. Many researches

have studied the relationship between weather and pollen levels and how it affects severity

of allergy symptoms in patients [45, 101, 46]. In our work, the allergy levels were

extracted from social media data instead of allergy patients, and we studied relationship

between the trend of allergy-related tweets with the actual pollen levels and temperatures

at U.S. state level.

In this paper, we focused on examining only allergy activity using a large Twitter

stream collected over two years and showed in-depth spatiotemporal analysis results. We

also applied natural language processing techniques to automatically identify prevalent

allergy types from Twitter contents.

3.5. Summary

In this work, we proposed a system that monitored allergy levels near real-time by an-

alyzing streaming Twitter data. We first classified tweets to identify those that mentioned

actual allergy incidents using bag-of-words model and NaiveBayesMultinomial classifier

and then used those tweets with positive labels for text and spatiotemporal analysis.

We used text-mining techniques to automatically detect predominant allergy types.

The top thirty allergy types extracted by our algorithm had precision of 86.7%. The

experimental results further showed that the rank of the most prevalent food allergy

types detected from tweet stream was highly correlated to the ground truth value, the

ranked list of prevalent allergies, obtained from real-world allergy patients’ data.

We demonstrated that tweet time-series graph mentioning seasonal allergy related

terms (e.g., pollen) showed clear seasonal patterns (a large volume of tweets in the spring
58

and a low volume of tweets in the winter) whereas those mentioning non-seasonal allergy

related terms (e.g., peanut) remained relatively constant throughout the year. By study-

ing relationships between allergy tweets with the pollen and the weather data, we showed

that all three data had similar seasonal patterns and allergy tweet data had a very strong

relationship with the daily maximum temperature (correlation of 0.688).

We believe that our work was the first study that examined a large-scale social media

data for in-depth analysis of allergy activities. Although our work had specifically focused

on studying allergy activities, the model could be generalized to track activities of other

diseases.
59

CHAPTER 4

Real-Time Digital Diseases Surveillance using Twitter Data

Demonstration on Flu and Cancer

4.1. Introduction

The Internet is usually the first place people turn for health information. People

search for a specific disease, symptoms, and appropriate medical treatments, and often

make decisions whether they should go see a doctor based on the search results. Healthcare

portal sites and the social media are popular online health information resources among

U.S. Internet users [?]. Disease surveillance is the monitoring of clinical syndromes such

as flu and cancer that have a significant impact on medical resource allocation and health

policy. Disease surveillance plays an important role in minimizing the harm caused by the

outbreaks by constantly observing the disease spread. The traditional approach employed

by the Centers for Disease Control and Prevention (CDC) [18] for flu surveillance includes

the collection of Influenza-like Illness (ILI) patients’ data from sentinel medical practices.

The main drawback of this method is the 1-2 weeks time lag between the time of medical

diagnosis and the time when the data becomes available. Early detection of a disease

outbreak is critical because it would allow faster communication between health agencies

and the public, and provide more time to prepare a response.


60

We built a novel real-time disease surveillance system that used Twitter data to track

U.S. influenza and cancer activities. Twitter1 is a popular micro-blogging service where

users can post short messages. Twitter’s popularity as a medium for real-time information

dissemination has been constantly increasing since its launch in 2006. The proposed sys-

tem continuously downloads flu and cancer related Twitter data using Twitter streaming

API [17] and applies spatial, temporal, and text models on this data to discover national

flu and cancer activities and popularity of disease-related terms. The outputs of the three

models are summarized as pie charts, time-series graphs, and U.S. disease activity maps

on our project websites [15][16] in real time. This demonstration was built upon and ex-

tended our previous work [2]. In this work, the text analysis on most frequently occurring

terms was added. We further extended our real-time disease surveillance system to track

cancer activities in addition to flu. This work was published in [3].

4.2. System Description

Figure 4.1 shows the architecture of our real-time flu and cancer surveillance system.

Our dataset consisted of all recent tweets that mentioned the keywords ‘flu’ or ‘cancer’.

We collected over 6 million flu-related tweets generated by more than 3.3 million unique

users for 5.5 months since October 16, 2012, and over 3.7 million cancer-related tweets

generated by more than 1.3 million unique users for 3 months since January 7, 2013.

Such big data presents a number of challenges due to its size and complexity, relating

to its storage, retrieval, analysis, and visualization, especially when the whole process is

required to be done in real-time as in this work. Our system was designed to be a disease

surveillance system that is (almost) always available, robust, and easily scalable for big
1
https://twitter.com/
61

:)./*&;<4(&"$=4'4'/$

!"#$

3)5;.*&"$=4'4'/$
%&'()*$

+&,&$-,.*&/)$$
&'0$
1)&"2345)$6'&"78(9$
3)>,$=4'4'/$

Figure 4.1. Real-Time Disease Surveillance System continuously down-


loaded flu and cancer related tweets and applied geographical, temporal,
and text mining. The real-time analysis data was visually reported as
U.S. disease activity maps, timelines, and pie charts on our project web-
sites [15][16].

data. Different from many other related big data projects, which performed analytics on a

massive, static dataset, our system consisted of a cluster of several transactional databases

and high-dimensional data warehouses which were updated in real time. In our proposed

system, three types of analytics were considered - geographical/spatial, temporal, and

textual, the results of which were suitably presented pictorially, as described next.
62

Figure 4.2. Our Real-Time Digital Flu Surveillance Website [16]. The
‘Daily Flu Activity’ chart was an output of the temporal analysis and
showed the volume changes of tweets mentioning the word ‘flu’ over time.
The dramatic increase of flu tweet volume from Jan. 6 to Jan. 12 coin-
cided with the dates when the major U.S. newspapers reported Boston Flu
Emergency [21] and deaths of four children from the AH3N2 influenza out-
break [20]. The‘U.S. Flu Activity Map’ was an output of the geographical
analysis and showed the weighted percentage of tweet volumes mentioning
‘flu’ by states. The level of flu activity was differentiated by different colors
for an easy comparison of U.S. regional flu epidemic.

4.2.1. Geographical Analysis

The goal of geographical analysis was to track disease spread in U.S. states by measuring

the volume of flu/cancer tweets generated in the region. For our experiments, we used

users’ home locations in their Twitter profiles. The dataset for geographic analysis was

all users who mentioned ‘flu’ or ‘cancer’ and had a valid U.S. state information (e.g.,
63

‘Evanston, IL’, ‘somewhere in NY’) in their home location fields. We excluded tweets

generated from outside the U.S. (i.e., tweets from foreign countries) and those with invalid

location information (e.g., ‘travelling’, ‘Wherever the wind blows me’). In our flu dataset,

there were 458,828 users with valid U.S. state information, and in our cancer dataset,

there were 193,797 users with valid U.S. state information. The U.S. Flu Activity Map

is shown in Figure 4.2. The tweet volume mentioning ‘flu’ generated in each state was

normalized by the population of the state.

4.2.2. Temporal Analysis

The goal of temporal analysis was to track the volume changes of tweets mentioning the

disease and related terms over time.

Disease Daily Activity Timeline. As shown in Figure 4.2, Daily Flu Activity chart

shows the tweet volume changes of flu-related tweets over three months period from

January through March 2013. The data for flu/cancer timeline is created by counting the

number of tweets mentioning ‘flu’ or ‘cancer’ generated daily. Our assumption was that

people would talk more about ‘flu’ when they themselves or people around them (e.g.,

family or friends) had flu symptoms and there would be more frequent news feeds when

the epidemic was wide spread. Achrekar et al. [22] reported that the volume of flu-related

tweets was highly correlated with the number of reported ILI cases by the CDC. In the flu

timeline, the number of flu related tweets started increasing on January 6 and reached its

peak on January 12, which coincides with the date when The Huffington Post reported

the death of four children from the outbreak of AH3N2 influenza [20]. This showed how

our temporal analysis effectively reflected the wide spread of the epidemics.
64

Figure 4.3. Flu Symptoms Timeline. The timeline displays tweet volume
changes mentioning different flu symptoms from January through March
2013. ‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
level in mid January and decrease as the actual national ILI level by CDC
decreases.

Types, Symptoms, Treatments Timelines. We not only tracked the overall flu and

cancer activities, but also monitored disease types, symptoms, and treatments over time.

Figure 4.3 shows the daily tweet volume changes for various flu symptoms. From the

timeline chart, we could easily tell the types and levels of flu symptoms in the general

population at a specific point in time. Cough and fever were the two most dominant

symptoms throughout all flu season, and headache and sore throat were the next two

most common flu symptoms. The actual U.S. national influenza activity level (percentage

weighted Influenza-like Illness by the CDC) was plotted as red squares for reference. Tweet

volumes mentioning flu symptoms reached their highest point around mid January and

decreased as the actual flu activity level from the CDC decreased.
65

Figure 4.4. Distribution of Cancer Types in Tweets.

Figure 4.5. Distribution of Cancer Symptoms in Tweets .

Figure 4.6. Distribution of Cancer Treatments in Tweets .


66

4.2.3. Text Analysis

In text analysis, we revealed deep health insights by examining the content of the tweets.

We were interested in investigating the popularity of terms used in three categories: (1)

disease types (2) symptoms (3) treatments, and created a keyword list for each category.

For example, the keyword list for cancer types was a list of breast cancer, lung cancer,

skin cancer, brain cancer, etc., the keyword list for cancer symptoms was a list of lump,

cough, fatigue, weight loss, etc, and the keyword list for cancer treatments was a list of

surgery, radiation, chemotherapy, Emend, Xeloda, etc. We also had similar keyword lists

for ‘flu’. For ‘flu’, we had 9 flu types, 15 symptoms, and 31 treatments. For ‘cancer’, we

had 58 cancer types, 21 symptoms, and 63 treatments. Figure 4.4, 4.5, and 4.6 show the

distribution of tweets mentioning a keyword in cancer types, symptoms, and treatments

keyword lists.

Figure 4.7. Most Frequent Words in Flu Tweets.

We were interested in investigating which words frequently co-occured with a disease

name. After tokenizing tweet texts and removing all stop words, we counted the number of

occurrence of each unique word. Our flu dataset (6,097,406 tweets) consisted of 83,896,915
67

words and 4,001,445 unique words. Figure 4.7 shows the top 20 most frequent words in

our entire flu dataset.

4.3. Summary

We built a real-time disease surveillance system that used Twitter data to automati-

cally track flu and cancer activities. The experiments showed that our disease detection

system could map U.S. regional influenza and cancer activity levels near real-time, discover

and compare popularity of terms related to flu/cancer types, symptoms, and treatments.

The system could also effectively track daily flu/cancer activities and the volume changes

of tweets mentioning disease related terms over time. All of the output data was visualized

as interactive maps, pie charts, and time series graphs on our project websites [15][16].

Our system is highly scalable and can be easily extended to track other diseases. Because

the system is completely automated, it would be a very low-cost alternative to replace

the traditional high-cost disease surveillance system that collects public health data from

sentinel medical practices.


68

CHAPTER 5

Forecasting Influenza Levels using Real-Time Social Media

Streams

5.1. Introduction

Seasonal influenza is an acute viral infection that can cause severe illnesses and com-

plications. For instance, the annual epidemics cause about 250,000 to 500,000 deaths

worldwide. Centers for Disease Control and Prevention (CDC) reported 105 pediatric

deaths due to influenza during 2012-2013 flu season1. Monitoring of disease activity en-

ables an early detection of disease outbreaks, which will facilitate faster communication

between health agencies and the public, thereby providing more time to prepare a re-

sponse. Disease surveillance helps minimize an impact from a pandemic and make better

resource allocation. The traditional influenza surveillance system by CDC reports weekly

national and regional Influenza-Like Illness (ILI) physicians visit data collected from sen-

tinel medical practices2. This data is updated once a week and there is typically a two

weeks time lag before the data is published. Furthermore, the published data is updated

for several more weeks as more clinical data is gathered.

For an early detection of influenza activity, Ginsberg et al.[50] proposed a method

that used flu-related online search engine query data to estimate the current flu activity

with one day reporting lag, 1-2 weeks ahead of CDC, and its estimation had been known
1
http://www.cdc.gov/flu/spotlights/children-flu-deaths.htm
2
http://www.cdc.gov/flu
69

to be reasonably accurate for most parts. However, in February 2013, an article titled

“When Google got flu wrong” [35] reported Google Flu Trends’s over-estimation of peak

of U.S. flu activity, which was almost double that of CDC’s observations.

During the last decade, the number of internet and social networking site users have

dramatically increased. People share ideas, events, interests and their life stories over the

internet. As of January 2017, Twitter has 100 million daily active users and 5 million

tweets are generated per day3. Experiences and opinions on various topics including

personal health concerns, symptoms and treatments are shared on Twitter. Mining such

publicly available health related data potentially provides valuable healthcare insights.

Furthermore, the increasing number of users that access social media platforms on their

mobile devices makes social media data an invaluable source of real-time information.

In this paper, we proposed a model that (1) predicted future influenza activities,

(2) provided more accurate real-time assessment than before, and (3) combined real-

time social media data streams and CDC historical datasets for predictive models to

accomplish accurate predictions. The results showed that our model using multilayer

perceptron with back propagation on a large-scale Twitter data could forecast current

and future flu activities with high accuracy. The goal of our work was to predict expected

influenza activity for the future, a week or more ahead of time so that it could be used

for planning, intervention, resource allocation and prevention. Furthermore, we aimed to

exploit social media communication for the prediction. This work was published in [5].

3
https://www.omnicoreagency.com/twitter-statistics/
70

5.2. Related Work

For an early detection of disease outbreaks, researchers had used different statistical

and machine learning algorithms on difference sources of data. Over-the-counter phar-

maceutical sales data [72] and telephone triage [47] had been used for surveillance of

ILI. Christakis et al. [43] studied whether monitoring of social friends could provide early

detection of flu outbreaks. Web search queries data had been used for influenza surveil-

lance [48, 55, 84, 104, 50, 93, 83]. Ginsberg et al. [50] used flu-related google search

queries data to estimate current flu activity and the near real-time estimation was reported

on Google Flu Trends (GFT) website4. Researchers had used GFT data to build an early

detection system for flu epidemics [83, 93]. Shaman et al. [93] used GFT data and

WHO/NERVSS collaborating laboratories data to estimate flu activity. The estimated

data was then recursively used to optimize a population-based mathematical model that

predicted flu activity. Pervaiz et al. [83] developed FluBreaks5, an early warning system

for flu epidemics using Google Flu Trends.

The use of social networking sites for public health surveillance had been steadily

increasing in the past few years [37]. Most diseases surveillance works using social media

data were focused on Twitter. A very unique feature of Twitter is that messages propagate

in real time. Many had used Twitter data to predict various real world outcomes [89,

26, 32].

For current estimation of influenza activity, Signorini et al. [95] applied support vector

regression algorithm to Twitter stream generated during the influenza A H1N1 pandemic

to public sentiment, and Achrekar et al. [22] used auto-regression with exogenous inputs
4
http://www.google.org/flutrends
5
http://www.newt.itu.edu.pk/flubreaks
71

(ARX) model on Twitter data. In our previous work, we built a real-time disease surveil-

lance website that tracked U.S. regional and temporal flu activities including popularity

of terms related to flu types, symptoms, and treatments [2, 3]. Aramaki et al. [24] pro-

posed a Twitter-based influenza epidemics detection method that used natural language

processing (NLP) to filter out negative influenza tweets. Chew et al. [41] analyzed con-

tent and sentiment of tweets generated during the 2009 H1N1 outbreak and showed the

potential and feasibility of using social media to conduct infodemiology studies for public

health.

Paul and Dredze [81] applied Ailment Topic Aspect Model to track illnesses over times

(syndromic surveillance), measure behavioral risk factors, localize illnesses by geographic

region, analyze symptoms and medication usage, and showed the broad applicability of

Twitter data for public health research. Li [69] proposed Flu Markov Network (Flu-MN),

a spatio-temporal unsupervised Bayesian algorithm based on a 4 phase Markov Network

for flu activity prediction. Lampos et al. [65] proposed an automated tool that tracked

ILI in the United Kingdom using a regression model and Bolasso, the bootstrapped ver-

sion of LASSO, for features extraction of Twitter data. Lamb et al. [63] classified tweets

into different categories to distinguish those that reported infections versus those that ex-

pressed concerns about flu, tweets about authors versus tweets about others in an attempt

to improve performance of influenza surveillance. Researchers had studied the diversity

of tweets [57], ran real-time spatio-temporal analysis of West Nile virus using Twitter

data [61]. Sugumaran and Voss advised to integrate existing epidemic systems, those

that used crowd-sourcing, news media (e.g., GPHIN, MedISys), mobile/sensor network,

and real-time social media intelligence, for an improved early disease outbreak system [98].
72

Chakraborty et al. [38] combined social indicators and physical indicators and used a ma-

trix factorization-based regression approach using neighborhood embedding to predict ILI

incidences in 15 Latin American countries.

Retrospective analysis and current estimates are important as they can describe the

observed trends. However, further prediction of future flu levels can represent a big leap

because such predictions provide actionable insights for public health that can be used for

planning, resource allocation, treatments and prevention. In contrast to other approaches,

we proposed a system that not only estimated current flu activity more accurately, but

also forecasted future influenza activities a week in advance beyond the current week

using aggregated ILI data by CDC and real-time Twitter data. The results showed that

our proposed model using multilayer perceptron with back-propagation algorithm could

forecast both current and future influenza activities with high accuracy.

5.3. Method

The data collection and modeling process is illustrated in Figure 5.1.

5.3.1. Dataset

We continuously downloaded publicly available tweets that mentioned ‘flu’ using Twitter

Streaming API6. The dataset used in this paper consisted of 20 million tweets generated

between December 2012 and May 2014. 71 weeks’ data (from week 1, 2013 until week 19,

2014) were used to build the model. Disambiguation of tweets was performed using text

analysis techniques to understand if a tweet was about a person talking about his/her own

flu or about someone else’s or if there were any mentions of common symptoms. Table 5.1
6
https://dev.twitter.com/docs/streaming-apis
73

!
"#$%&'!()**+! >1>!
,+-*./! !
! CDC!1.+.!'5%%*'456!
0!1&2./$&3#.456! =-5/!2*646*%!
0!7&%+*-&63! /*?&'.%!E-.'4'*2!
0!8*+)5-9!:6.%;2&2!
!

1.+.!"-5'*22&63! :6.%;2&2! C6F#*6G.!75-*'.2+!


! !
!
0!,/55+<&63! 0!"-*?&'4@*!A5?*%&63!
0!B.%&?.456!
!
0!(&/*!,*-&*2!!!!
!!"!#$%&'(
!
(-.62=5-/.456! !"!#$%&)(

!"!#$%&*(
!
0!:%&36/*6+!)&+<!>1>! +,--%.#$%&'(

! 0$1234+1"#252#$%(

!
+,--%.#$%&)(

+,--%.#$%&*(

?.+.! +,--%.#$%&/(

+,--%.#$%(
!

Figure 5.1. Data collection and modeling process. Disambiguation, filter-


ing and network analysis were performed on continuously downloaded flu-
related tweets. Weekly time-series flu-related tweet counts were computed
after data was smoothed out to align with CDC data. Current and 1-week
ahead flu prediction models were built.

lists examples of flu-related tweets. In the category column, user indicates that the tweet

is about the Twitter user being sick with flu, someone else indicates that the tweet is

about someone else (friends, family, etc.) being sick with flu, and symptom indicates

that the tweet describes one’s flu symptoms. Data was filtered to remove tweets that may

contain product advertisements (or links to websites) and using network analysis repeated

tweets by the same persons were filtered.

5.3.2. Data Preprocessing

The following data preprocessing steps were taken on Twitter data.


74

Table 5.1. Examples of flu-related tweets.


Tweet Category
I’ve got the worst flu ever... already D: user
After a week sick in bed with the flu, look what I just woke up to! user
trying to get over this flu... I had completely forgot how much harder user
it is to deal with it during pregnancy.. feeling like death :”c
This flu and cough is killing me T.T user, symptom
Coding OAuth2 filters with a flu and fever... I look better with a user, symptom
mask on!
@friend feel better! The flu is nooo fun! Huggs!! someone else
My roommate has the flu and I get sick really fast I am packing my someone else
stuff and won’t be returning
please pray for my mom she’s caught the flu and is extremely ill at someone else
this moment
Sore throat, fever, flu, headache, cough. Uhuk uhuk symptom
sick with flu, sore throat, and slight fever. symptom

• Smoothing: We took 7-day moving average of daily tweet volume to identify the

long-term flu activity trend by smoothing out the fluctuations and noise in the

short-term data. Moving average is a popular technique for analyzing time-series

data that is often used in financial data analysis such as stock prices.

• Weekly counts and alignment: Weekly Twitter data was then computed

by summing smoothed daily tweet volumes from Sunday through Saturday. The

dates for weekly Twitter data were aligned with dates in CDC weekly surveillance

reports so that analysis and predictions could be validated with CDC reports.

• Normalization: Weekly data was normalized by dividing each weekly data by

the maximum of 72 weekly data points.

5.3.3. Feature Selection

In order to perform predictive modeling, features from the data were defined and extracted

as described below. Figure 5.2 depicts the data available at the end of week t. Wt denotes
75

!"#$% % %!"#& %%%%%%%%%!"#' %%%%%%%%%!"#( %%%%%%%!" %%%%%!")( %%%%%!")' %%%!")&%%%%

*+*%+,",%,-,./,0/1%234/%!"#'%

56.718%+,",%,-,./,0/1%234/%!"%

Figure 5.2. Data available at current week t. At the end of week t, all flu-
related Twitter data collected during current week t and prior are available.
At time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as
CDC’s collection, retrospective analysis and reports take two weeks.

Table 5.2. CDC and Twitter features used in flu prediction model.

Notation Description
CDC-4-3-2 CDC ILI Data for Wt−4 , Wt−3 , Wt−2
CDC-3-2 CDC ILI Data for Wt−3 , Wt−2
CDC-2 CDC ILI Data for Wt−2
Twitter-4-3-2-1-0 Twitter Data for Wt−4 , Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-3-2-1-0 Twitter Data for Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-2-1-0 Twitter Data for Wt−2 , Wt−1 , Wt
Twitter-1-0 Twitter Data for Wt−1 , Wt
Twitter-0 Twitter Data for Wt

the current week and any time window beyond this represents the future. Wt−n denotes n

week(s) prior to current week, and Wt+n denotes n week(s) after current week. Each week

starts on Sunday and ends on Saturday to align with CDC weekly data. CDC data for

current week, Wt , and the week before, Wt−1 , is not available due to the time it takes to

collect patients data from the sentinel practices. The latest available CDC data is weekly

data for Wt−2 .

Since we were able to download publicly available tweets in real time, we had all

Twitter data generated during Wt . We used the most recent 5 weeks’ data for both CDC

and Twitter in our experiments. We experimented with different combinations of CDC

and Twitter data shown in table 5.2 as features of our predictive model to find the best
76

Table 5.3. Twitter data improves prediction performance.


Current Forecast
Feature Correlation Coefficient Improvement
CDC-4-3-2 Twitter-4-3-2-1-0 0.9525 +2.93%
CDC-4-3-2 0.9232
1-Week Ahead Forecast
Feature Correlation Coefficient Improvement
CDC-3-2 Twitter-4-3-2-1-0 0.9268 +6.37%
CDC-3-2 0.8631

Table 5.4. Comparison of current flu forecast model’s performance when


different learning rates and a varying number of hidden layers and hidden
units are used. The highest correlation of 0.9559 was obtained using learning
rate λ = 0.2 and one hidden layer with 4 activation units.
Number of activation units in first and second hidden layers
Learning Rate 2-0 3-0 4-0 5-0 2-2 3-2 4-2 5-2 2-3 3-3
λ = 0.1 0.9517 0.9496 0.9501 0.946 0.7359 0.8843 0.8976 0.9008 0.8973 0.9143
λ = 0.2 0.9548 0.954 0.9559 0.9527 0.9482 0.9481 0.9469 0.946 0.9498 0.9485
λ = 0.3 0.953 0.9548 0.9532 0.9499 0.9509 0.9511 0.95 0.9495 0.9518 0.9512
Number of activation units in first and second hidden layers
Learning Rate 4-3 5-3 2-4 3-4 4-4 5-4 2-5 3-5 4-5 5-5
λ = 0.1 0.9038 0.9115 0.915 0.9117 0.9182 0.9134 0.9168 0.9176 0.9256 0.9224
λ = 0.2 0.9465 0.9457 0.9501 0.948 0.9472 0.9455 0.9502 0.9483 0.9472 0.9466
λ = 0.3 0.9495 0.9492 0.9521 0.9506 0.9504 0.9491 0.9523 0.951 0.9504 0.9496

Table 5.5. Comparison of 1-week ahead flu forecast model’s performance


when different learning rates and a varying number of hidden layers and
hidden units are used. The highest correlation of 0.929 was obtained using
learning rate λ = 0.2 and one hidden layer with 4 activation units.
Number of activation units in first and second hidden layers
Learning Rate 2-0 3-0 4-0 5-0 2-2 3-2 4-2 5-2 2-3 3-3
λ = 0.1 0.9115 0.9176 0.9064 0.9018 0.8919 0.894 0.8907 0.8908 0.8984 0.8947
λ = 0.2 0.8996 0.904 0.929 0.9268 0.88 0.8843 0.8792 0.8768 0.8917 0.883
λ = 0.3 0.8491 0.8845 0.9268 0.8944 0.8831 0.878 0.8788 0.8775 0.887 0.8799
Number of activation units in first and second hidden layers
Learning Rate 4-3 5-3 2-4 3-4 4-4 5-4 2-5 3-5 4-5 5-5
λ = 0.1 0.8937 0.8931 0.8958 0.8981 0.8961 0.895 0.8957 0.8979 0.8981 0.8969
λ = 0.2 0.8806 0.8804 0.8948 0.8957 0.8877 0.8833 0.8965 0.8939 0.8916 0.8869
λ = 0.3 0.8759 0.8775 0.8893 0.8846 0.9023 0.8767 0.8902 0.9055 0.881 0.8824

features for influenza prediction. The model was trained and validated using 10-fold cross

validation on 71 weeks data. As shown in table 5.3, the best feature for the current

flu level forecast model was feature CDC-4-3-2 Twitter-4-3-2-1-0 (latest 3 weeks’ CDC
77

plus latest 5 week’s Twitter data) with correlation coefficient of 0.9525, with +2.93%

performance improvement over feature CDC-4-3-2 (latest 3 weeks’ CDC data). The best

feature for 1-week ahead prediction model was CDC-3-2 Twitter-4-3-2-1-0, which resulted

in correlation coefficient of 0.9268, with +6.37% improvement over CDC-3-2. This clearly

showed that adding Twitter data significantly improved the performance of both current

and future flu level forecasts compared to that using only past CDC data.

5.3.4. Predictive Modeling

The proposed model had two parts. The first estimated current flu activity in terms of

percentage of ILI-related physicians visit (2 weeks ahead of CDC data). The second part

was forecasting future influenza activity a week into the future (3 weeks ahead of CDC

data). We used multilayer perceptrons (MLP) with back propagation as it had the best

performance among many learning and predictive modeling algorithms we experimented

with in forecasting both current and future influenza activities. In our experiments, we

used 3-layer MLP with 4 activation units in the hidden layer. The network structure for

our current flu activity forecast model is shown in figure 5.3.

5.4. Results

Table 5.4 and 5.5 show how the performance of current and 1-week ahead forecast

model changed with different values of learning rate and a varying number of hidden lay-

ers and units in each hidden layer respectively. In notation ”A-B”, A indicates the number

of activation units in first hidden layer (layer 2) and B indicates the number of activation

units in second hidden layer (layer 3). Both the current and the 1-week ahead forecast

models achieved the best performance using learning rate λ = 0.2 and 3-layer multilayer
78

CDC_Wt-4

CDC_Wt-3

CDC_Wt-2

Tweets_Wt-4

%WEIGHTED_ILI_Wt
Tweets_Wt-3

Tweets_Wt-2

Tweets_Wt-1

Tweets_Wt

Figure 5.3. Structure of multilayer perceptron used in our influenza activity


forecast model.

perceptron structure (input layer, 1 hidden layer, output layer) with 4 activation units in

the hidden layer as shown in Figure 5.3.

Current Influenza Activity Estimation

Our current flu forecast model used CDC-4-3-2-Twitter- 4-3-2-1-0 (i.e., all currently

available CDC and Twitter data generated in recent 5 weeks) as features because it gave

the highest correlation of 0.9525 when the model was trained and validated using 10-fold

cross validation on 71 weeks data. Although our Twitter dataset had been collected for

1.5 years, each weekly data made only one data point for the weekly flu activity forecast

model. To best utilize the number of available data points, we built the initial model

using the first one year data (52 data points for year 2013) with 10-fold cross validation.

Then, each week, we incrementally built a new model with all available data points. For

example, a new model was trained using 52 data points (week 1, 2013 – week 52, 2013) to
% Weighted Influenza-Like Illness % Weighted Influenza-Like Illness

0
2
4
6
8
10
12

0
2
4
6
8
10
12
2012-week1 2012-week1
2012-week3 2012-week3
2012-week5 2012-week5
2012-week7 2012-week7
2012-week9 2012-week9
2012-week11 2012-week11
2012-week13 2012-week13
2012-week15 2012-week15
2012-week17 2012-week17
2012-week19 2012-week19
2012-week21 2012-week21
2012-week23 2012-week23
2012-week25 2012-week25
2012-week27 2012-week27
2012-week29 2012-week29
2012-week31 2012-week31
2012-week33 2012-week33
2012-week35 2012-week35
2012-week37 2012-week37
2012-week39 2012-week39
2012-week41 2012-week41
2012-week43 2012-week43
2012-week45 2012-week45
2012-week47 2012-week47
2012-week49 2012-week49
2012-week51 2012-week51
2013-week1 2013-week1
2013-week3 2013-week3
2013-week5 2013-week5
2013-week7 2013-week7
2013-week9 2013-week9

unseen test data points were obtained.


Time

Time
2013-week11 2013-week11
2013-week13 2013-week13
2013-week15 2013-week15
2013-week17 2013-week17
2013-week19 2013-week19
2013-week21 2013-week21
2013-week23 2013-week23
2013-week25 2013-week25
2013-week27 2013-week27
2013-week29 2013-week29
(a) Current U.S. Influenza activity

2013-week31 2013-week31

(b) 1-week ahead U.S. Influenza activity 2013-week33 2013-week33


2013-week35 2013-week35
2013-week37 2013-week37
2013-week39 2013-week39
2013-week41 2013-week41
2013-week43 2013-week43
2013-week45 2013-week45
2013-week47 2013-week47
2013-week49 2013-week49
2013-week51 2013-week51
2014-week1 2014-week1
2014-week3 2014-week3
2014-week5 2014-week5
2014-week7 2014-week7
over 52 training data and a correlation coefficient of 0.71 over 19 previously 2014-week9 2014-week9
were obtained. For 1-week ahead forecast, a correlation coefficient of 0.895
data and a correlation coefficient of 0.929 over 19 held-out test data points
current week prediction, a correlation coefficient of 0.9522 over 52 training
activity forecast results against CDC and Google Flu Trends data. For
Figure 5.4. Comparison of our current and 1-week ahead U.S. influenza

2014-week11 2014-week11
CDC (% ILI)

2014-week13 2014-week13
2014-week15 2014-week15
Google Flu Trends
CDC (% ILI)

2014-week17 2014-week17
2014-week19 2014-week19
Current Prediction

model was for the first week of 2013 because we started collecting flu-related Twitter data
and Google Flu Trends data (GFT) [50] (green line). The earliest prediction by our
compares our flu activity prediction (red line) against the actual CDC %ILI (blue line)
a larger data set and therefore be more robust. Figure 5.4 is a time-series graph that
2, 2014. As we continued to collect more Twitter data, the model would be trained on
using 53 data points (week 1, 2013 – week1, 2014) to make current prediction for week
make current flu level prediction for week 1, 2014. Then a newer model was built again
79

Google Flu Trends

1-Week Ahead Prediction


80

in late 2012. Both our prediction (Fig. 5.4(a)) and GFT data were available two weeks

earlier than the official CDC ILI report. Our model was fitted on 52 weeks data (week 1,

2013 – week 52, 2013) with a correlation of 0.9522 and a mean absolute error (MAE) of

0.2383, and was further validated on 19 previously unseen weekly data (week 1, 2014 –

week 19, 2014) with a correlation of 0.929 and MAE of 0.493. Our prediction did as well

or better than the GFT data at most data points, and aligned very well with the CDC

ILI data. Furthermore, our prediction performed significantly better than GFT during

January 2013 when GFT’s algorithm significantly overestimated peak flu levels [35].

Future Influenza Activity Forecast

Our 1-week ahead flu forecast model used CDC-3-2-Twitter-4-3-2-1-0 as features. This

feature set provided the highest correlation of 0.9268 on the model trained and validated

using 10-fold cross validation on 71 weeks data, which was higher than the correlation

of 0.8952 obtained by using only CDC-3-2. Here also adding Twitter data improved

the model performance. An initial model was built using the first one-year data and a

newer model was incrementally rebuilt in the following weeks (in a similar manner our

current flu forecast model was built). Our 1-week ahead forecast data (Fig. 5.4(b)) was

available 3 weeks ahead of the official CDC ILI report and 1 week ahead of GFT data. The

model was fitted using 52 data points (week 1, 2013 - week 52, 2013) and incrementally

rebuilt using all available data (including the new weekly data collected during the current

week) thereafter. The final model was validated by measuring a correlation between the

CDC weekly percentage weighted ILI and that predicted by our model on 19 additional

previously unseen weekly data points (week 1, 2014 through week 19, 2014). A correlation
81

of 0.895 and MAE of 0.3846 were obtained on the training data and a correlation of 0.71

and MAE of 0.662 were obtained on the previously unseen test data. These results were

very good considering our forecast data was available 3 weeks faster than the official CDC

data.

5.5. Summary

We presented a model that predicted weekly percentage of U.S. population with

Influenza-Like Illness using multilayer perceptron with back propagation algorithm on

a large-scale social media stream. Adding recent flu-related Twitter data as features

improved the model’s performance for both current and future forecasts. Our proposed

model could predict current and future influenza activities with high accuracy 2-3 weeks

faster than the traditional flu surveillance system could. The performance for the cur-

rent prediction was comparable to or better (in January 2013) than GFT. We expect the

model’s performance to improve as we continuously collect more Twitter data. We believe

these results present a very important step in not only accurately forecasting flu activity

for the future, prevention, resource planning, but also demonstrating a technique that

can combine social media, unstructured communication data, with observational data for

prediction.
82

CHAPTER 6

Medical Concept Normalization

6.1. Introduction

On social media and online health communities, people often share their experiences

and opinions on various health topics including personal health issues and symptoms.

Especially, on medical forums, consumers ask health related questions, write reviews

on medications and describe negative side effects they experience while taking a drug.

Moreover, patients and their families can get emotional support by sharing their stories

of overcoming illnesses.

Medical concept normalization for user-generated texts aims at mapping a health

condition described in colloquial language to a medical concept in standard ontologies such

as Unified Medical Language System (UMLS) [71] via concept unique identifiers (CUIs).

This task has many applications for improving patient care such as: 1) understanding

questions and providing answers to patients/families seeking medical knowledge, 2) early

detection of patients who need immediate attention and medical support (e.g., people with

suicidal ideation), 3) digital disease surveillance (e.g., monitoring of pandemics), and 4)

clinical paraphrasing to improve patient engagement by helping patients understand their

clinical reports.

While consumers describe their health conditions in colloquial language, clinical knowl-

edge sources such as biomedical literature present medical terms in scientific language.
83

Table 6.1. Medical concepts in UMLS and example social media phrases
that describe the medical concept

Medical Concept Social Media Phrases


loss of hair hair falling out, hair loss, hair losss, losing my hair, thinning hair, hair has
started falling out, hair is getting very thin, hair was falling out
memory impairment memory problem, memory failure, memory deficits, poor memory, trouble re-
membering, memory weakened, couldn’t remember, foggy brain
ankle pain ankle hurt, ankles started aching, pain in ankles, ankles seized up, sore ankles,
terrible pain in my ankles, ankles ache so bad
diarrhoea direar, diaharrea, diahhrea, sore and stiff ankles, diahrea, diarrehea, dioreah,
dioreaha, bathroom with the runs
difficulty sleeping can not sleep, difficult to sleep, hard time sleeping, inability to sleep well, lousy
sleeping at night, poor sleep, problems sleeping, trouble sleeping

This gap in the use of languages between patients/consumers and clinicians requires map-

ping of one to the other. In order to generate solutions to a given medical problem (e.g.

to answer questions posted on an online health community), health conditions in user-

generated texts need to be normalized to medical concepts in standard ontologies. Once

the solution is generated, it needs to be translated back to colloquial language for users

to easily understand.

Table 6.1 shows examples of user-generated texts from social media that describe

medical concepts. The labels in the top row are medical concepts from the standard

medical ontologies and the phrases in the same column denote example phrases from social

media that describe the concept. The examples very well illustrate the characteristics of

colloquial language or non-standard terms used to describe medical conditions on social

media. As can be seen in the table, the challenges for medical concept normalization

include: 1) alternative descriptions for health conditions in colloquial language (e.g., ‘sore

and stiff ankles’, ‘terrible pain in my ankles’, ‘ankles ache so bad’ → ankle pain; ‘trouble

sleeping’, ‘cannot sleep’, ‘hard time sleeping’ → difficulty sleeping, and 2) no overlaps

of terms between colloquial language and scientific/medical terms describing the same
84

health condition (e.g., ‘couldn’t remember’ → memory impairment, ‘sight loss’ → visual

impairment, ‘trouble remembering’, ‘foggy brain’ → memory impairment). In the latter

case, basic string matching approaches without understanding semantics of the text will

result in a poor performance in a medical concept normalization task. Other challenges

include misspellings or typos as shown for the concept ‘diarrhoea’.

In this work [6], we aimed to address the aforementioned challenges using deep learning-

based architectures and studied the impact of different types of input data used to build

neural embeddings on the medical concept normalization performance.

Our key contributions are:

• We investigated the use of various domain-specific text data to build neural em-

beddings to learn semantic features of medical concepts for normalization.

• We demonstrated that two deep learning models (CNN and RNN) could better

predict the medical concepts when we used neural embeddings trained on domain-

specific clinical texts compared to those trained on a larger general domain text

corpus.

• Our best results presented the new state-of-the-art for two benchmark datasets,

outperforming the accuracy of a strong normalization model by up to +21.17%

on the Twitter data set and up to +21.28% on the AskAPatient data set.

This chapter is organized as follows. In section 6.2, we present related works on deep

neural network models, social media for healthcare, and medical concept normalization.

In section 6.3, we describe CNN and RNN models we used for concept normalization. In

section 6.4, we describe how we re-created the social media datasets and present the details
85

of text data from various clinical knowledge sources used to build neural embeddings. In

section 6.5, we present our experimental results, followed by conclusion in section 6.6.

6.2. Related Work

6.2.1. Social Media for Healthcare

Social media had been widely used as a new medium for real-time information transmis-

sion in various domains including health to track volume of mentions of disease, drugs,

and symptoms [3, 4], predict influenza activities, and detect adverse drug events (ADE)

earlier than the traditional influenza or ADE surveillance systems that had significant

time delays in data processing [40, 68]. For automatic extraction of medical concepts

from social media, researchers had used machine learning approaches such as CRF (Condi-

tional Random Fields) and HMM (Hidden Markov Model) to extract phrases that describe

medical concepts (e.g., disease, drugs, symptoms) [79, 90], identify relationships between

two medical concepts (e.g., duration, frequency, dosage, route for a drug, indication, side

effects, etc.), and to classify texts into different categories (e.g., health vs. non-health,

ADE vs. non-ADE) [99, 92, 68].

6.2.2. Deep Neural Network Models

Recurrent neural network (RNN) models have shown to be very effective in many natural

language processing (NLP) tasks. Unlike traditional neural network models, RNNs use

sequential information. Hence they are well-suited for tasks such as machine transla-

tion, speech recognition, language modeling and image caption generation. Traditionally,

convolutional neural network (CNN) models have been widely used in image processing
86

tasks (e.g., automatic recognition of hand-written numbers, object detection) because of

their ability to learn task-relevant features. However, with the recently proposed word

embedding models (word2vec) by Mikolov et al. [76, 77], deep neural network models for

NLP tasks have gained popularity. Kim [59] showed that a simple one layer CNN model

trained on top of pre-trained word vectors outperform several state-of-the-art models for

text classification such as sentiment analysis and question classification. Lee et al. [68]

explored semi-supervised CNN models to detect adverse drug events in tweets and demon-

strated that neural word embeddings trained on a smaller domain-specific dataset helped

more than the one trained on a larger random dataset for ADE classification. Deep learn-

ing models have also shown to be highly effective in other healthcare tasks such as clinical

diagnostic inferencing [86] and clinical neural phraphrase generation [54, 85].

6.2.3. Concept Normalization

Traditional approaches used for medical concept normalization include lexicon-based

string matching, heuristic string matching, and rule-based text mapping to a set of pre-

defined variants of terms [88, 25, 74]. DNorm [67] is a state-of-the-art concept (disease

name) normalization system that is based on pairwise learning to rank that learns similari-

ties between mentions and concept names. Limsopatham et al. used a machine translation

approach in which a social media phrase is translated into a formal medical concept. More

recently, Limsopatham et al. [70] showed that simple deep learning models, convolutional

neural network (CNN) and recurrent neural network (RNN), with pre-trained word em-

beddings induced from a large collection of Google News (GNews) and BioMed Central
87

(BMC) articles improved the performance over previous state-of-the-art concept normal-

ization models and reported that GNews was more effective than BMC for both CNN and

RNN across all datasets.

Our work significantly improved on the results from Limsopatham et al. [70] by refining

their original datasets and leveraging neural embeddings of various health-related text to

better learn the semantic characteristics of medical concepts and provided a new state-

of-the-art accuracy for medical concept normalization.

6.3. Model Description

In this section, we describe two deep learning models, convolutional neural network

(CNN) and recurrent neural network (RNN), we use for medical concept normalization.
/"%*"#.)$"% '()$*+)$"%,-.%()$"% !""#$%&

;< 01
8==) 02

8==# 04
03
#$?=
/67,
7 ! !
8"9,8""),:+$%
B+*=
@)"%= 05>1
A9.$@=@ 05

Figure 6.1. Generic convolutional neural network architecture.

6.3.1. Convolutional Neural Network (CNN)

CNN is a feed-forward neural network model that learns task-relevant semantic features

for text classification. Figure 6.1 depicts a simple CNN with an input layer, followed by
88

a convolutional layer with multiple filters, a pooling layer, and a final softmax classifier.

The input layer of CNN are phrases or sentences represented as a matrix. Each row of the

matrix is a low-dimensional vector (word embeddings) representing a token or a word.

Formally, given an input phrase x of length j, where x = xi , xi+1 , . . . , xi+j denotes a

sequence of words, and xi denotes a k-dimensional word vector, a filter w ∈ Rhk is applied

to a window of h words to produce a new feature in a convolution layer. For example, a

feature ci is generated as follows:

(6.1) ci = f (w · xi:i+h−1 + b)

from a window of words xi:i+h−1 where b is a bias and f is a nonlinear activation function.

Each feature is applied to the input matrix to produce a feature map. Then the features

are passed to a fully connected softmax layer to output the most probable label [59].

For example, for the eight word phrase ‘my feet feel like I have stone bruises’ using 300-

dimensional embedding, the input to the CNN would be a 8 x 300 matrix and the output

would be a CUI representing the medical concept ‘foot pain’.

6.3.2. Recurrent Neural Network (RNN)

RNN is a family of artificial neural networks that uses its internal memory to process

variable-length sequential data. Figure 6.2 shows an unrolled RNN architecture, where

xt , yt , ht are the input, output, and hidden states at time step t, and W , U , V are model

parameters corresponding to input, hidden, and output layer weights shared across all

time steps [54].


89

y0 y1 yt
Output Layer

V V V
h0 h1 ht
U U ... U
Hidden Layer

W W W

Input Layer
x0 x1 xt

Figure 6.2. Generic recurrent neural network architecture.

The hidden state ht can be formulated as follows:

(6.2) ht = f (W xt + U ht−1 ),

where the ht−1 is the previous hidden state, xt is the the current input, and f is an

element-wise nonlinear activation function.

Although RNN is a powerful model to encode sequences, it suffers from the vanishing

gradient problem while it tries to efficiently learn long-range dependencies [28]. We used

a gated recurrent unit (GRU) [42], which is known to be a successful remedy to the

vanishing gradient problem. The hidden state of GRU ht can be formulated as follows:

zt = σ(W z xt + U z ht−1 )

rt = σ(W r xt + U r ht−1 )

kt = tanh(W k xt + U k (rt ht−1 ))

(6.3) ht = (1 − zt ) kt + zt ht−1 ,

GRU cell has two gates, an update gate zt , and a reset gate rt . kt is the candidate hidden

state. zt , rt are computed using different weight parameters where zt determines how
90

much of the old memory to keep while rt determines how to combine the new input with

the previous memofy. Finally, kt is computed by exploiting rt , and ht is calculated to

denote the amount of information needed to be transmitted to the following layers.

6.4. Experimental Setup

6.4.1. Data

We used two data sets, TwADR-L (from Twitter) and AskAPatient, used by Limsopatham

et al. [70] for medical concept normalization1. TwADR-L was created by the authors of

[70], and AskAPatient dataset was created by Karimi et al. [58] for ADR (adverse drug

reaction), from which the authors extracted the gold standard mappings of phrases to

medical concepts.
Table 6.2. Data Statistics after removing duplicates from the combined
training, validation, and test data

TwADR-L AskAPatient
# unique phrases 2,944 4,469
# unique labels 2,220 1,036
# unique phrase-label pairs 3,157 4,496
# phrases with multiple labels 173 26
Min # examples per label 1 1
Max # examples per label 36 141
Avg # examples per label 1.42 4.35

In the original dataset, the TwADR-L had 48,057 training, 1,256 validation and 1,427

test examples. The test set (all test samples from 10 folds combined) consisted of 765

unique phrases and 273 unique classes (or medical concepts). The AskAPatient dataset

contained 156,652 training, 7,926 validation, and 8,662 test examples. The entire test

set (all test samples from 10 folds combined) consisted of 3,749 unique phrases and 1,035
1Available at https://zenodo.org/record/55013#.WKXwdxIrLde
91

Table 6.3. Examples of phrases with multiple labels

Social Media Phrase Multi-Labels (Medical Concepts)


shaking shivering, trembling, tremor
mad anger, rage
have no emotion emotional disorder, indifferent mood
mood swings bipolar disorder, disturbance in mood
sore pain, myalgia
high blood pressure increased venous pressure, hypertension,
findings of increased blood pressure

unique classes (medical concepts). The authors randomly split each dataset into ten equal

folds, ran 10-fold cross validation and reported the accuracy averaged across the ten folds.

We found that, in the original data set, many phrase-label pairs appeared multiple

times within the same training data file and also across the training and test data sets in

the same fold. In the AskAPatient data set, on average 35.82% of the test data overlapped

with training data in the same fold. In the Twitter (TwADR-L) dataset, on average

8.62% of the test set had an overlap with the training data in the same fold. Having

a large overlap between the training and the test data could potentially introduce bias

in the model and contribute to high accuracy. Therefore to remove the bias, we further

cleaned and recreated the training, validation, and test sets such that each phrase-label

pair appeared only once in the entire dataset (either in training, validation or test set).

First, we combined all examples in training, validation and test data from the original

data set and then removed all duplicate phrase-label pairs (examples that had the same

phrase and label pair and appeared more than once in training/validation/test datasets).

Table 6.2 shows statistics of the new dataset after removing duplicates. The Twitter data

set had 3,157 unique phrase-label pairs and 2,220 unique labels (medical concepts) while

173 phrases had multiple labels (i.e., they were assigned to more than one label). Many
92

concepts had only one example, and the concept that had the most number of examples

had 36 phrases. On average, each concept had 1.42 examples. The AskAPatient data set

had 4,496 unique phrase-label pairs and 1,036 unique labels while 26 phrases had multiple

labels. Table 6.3 shows examples of phrases that had multiple labels. For example, ‘mad’

could be mapped to ‘anger’ or ‘rage’, and ‘sore’ could be mapped to ‘pain’ or ‘myalgia’.

Second, we removed all concepts that had less than five examples. The statistics of the

final data are shown in Table 6.4. Third, we divided all examples without multiple labels

into random 10 folds such that each unique phrase-label pair appeared once in one of the

10 test sets. We added the pairs with multiple labels into the training data. This final

10-folds dataset was used in all our experiments.

Table 6.4. Data Statistics after removing concepts that had less than five
examples

TwADR-L AskAPatient
# unique phrases 543 2,494
# unique labels 65 228
# unique phrase-label pairs 617 1,427
# phrases with multiple labels 173 26
Min # examples per label 5 5
Max # examples per label 36 78
Avg # examples per label 9.5 11

6.4.2. Data Sources for Word Embedding

In this section, we describe different types of unlabeled text data we used for building

neural embeddings.
93

Figure 6.3. Definition, example sentence, synonyms, related words, near


antonyms and antonyms for the word ‘sore’ obtained from Merriam-Webster
Thesaurus.

6.4.2.1. Thesaurus (TH). For each word in TwADR-L and AskAPatient dataset (both

phrases and labels), we obtained the following six types of information from the Merriam-

Webster thesaurus2: definition, example sentence, synonyms, related words, near antonyms,

and antonyms. Figure 6.3 illustrates the information that was obtained for the word ‘sore’,

the second last example shown in Table 6.3. The definition of ‘sore’ included the label

‘pain’ and the list of synonyms also included ‘painful’ (an adjective form of the label

‘pain’). Therefore, the word embeddings built with the thesaurus would help the model

learn the semantics and predict the label ‘pain’.

Figure 6.4. Medical definition of the term ‘myalgia’ obtained from Merriam-
Webster Medical Dictionary.

2https://www.merriam-webster.com/thesaurus
94

6.4.2.2. Medical Dictionary (MD). We collected definitions from the Merriam-Webster

Medical Dictionary3, which contains 60,000 words and phrases used by healthcare pro-

fessionals. It is also used in the National Library of Medicine’s consumer health website

to help consumers with spelling of medical words and understanding of medical notes

written by physicians4. For each unique word in TwADR-L and AskAPatient dataset, we

obtained a medical definition (if present) using the Merriam-Webster medical dictionary

API5. The dictionary contains clinical terms that may not be found in the thesaurus.

We found that while definitions for some terms were same in both the thesaurus and the

medical dictionary, for other terms, either they used slightly different words/phrases, or

one or both did not have a definition at all. For example, the word ‘myalgia’ was in the

medical dictionary, but not in the thesaurus. As shown in Figure 6.4, we were able to

collect the definition for the word ‘myalgia’, a medical term that was not found in the

thesaurus.

6.4.2.3. Clinical Texts (CT). Clinical Texts is a collection of sentences from the fol-

lowing sources in the medical domain.

Adverse Drug Reaction Classification System (ADReCS)6: is a comprehensive

ADR ontology database that provides both standardization and hierarchical classification

of ADR terms [36]. The database integrates ADR and drug information collected from

3
https://www.merriam-webster.com/medical
4
https://www.nlm.nih.gov/news/mplusdictionary03.html
5
https://www.dictionaryapi.com/products/api-medical-dictionary.htm
6
http://bioinf.xmu.edu.cn/ADReCS/
95

various public medical repositories like DailyMed7, MedDRA [34], SIDER2 [62], Drug-

Bank8, PubChem9, and UMLS. It contains 6.7K unique ADR terms and 1,698 drug names,

and 154K drug-ADR pairs. For each term in the ADReCS database, we collected its defini-

tion and synonyms. For example, the definition of the word ‘myalgia’ is ‘painful sensation

in the muscles’ and its synonyms are ‘myalga’, ‘myaigia’, ‘soreness’, ‘muscle pain’, ‘muscle

ache’, etc.

Biomedical Literature: We collected 301,790 sentences from all wikipedia pages

that were under the category of clinical medicine10. We also collected 4,271 sentences

from PubMed articles from the adverse drug events benchmark corpus [52].

Medical Concept to Lay Term Dictionaries: We used two medical to lay terms

dictionaries to create a collection of sentences11,12. These dictionaries contain professional

medical terms and their definitions described in lay language. For example, the medical

term ‘anesthesia’ is defined in lay language as ‘loss of sensation or feeling’, the term

‘cephalalgia’ as ‘headache’, and the term ‘dyspnea’ as ‘hard to breathe’ or ‘short of breath’.

From these dictionaries, we generated sentences (e.g., ‘Anesthesia refers to loss of sensation

or feeling’, ‘cephalalgia means headache’) by combining a term and its definition with a

connecting phrase randomly chosen from a small preselected set (e.g., stands for, refers

to, indicates, means, etc.). We created a total of 1,556 sentences from these sources.

7
https://dailymed.nlm.nih.gov/dailymed/
8
https://www.drugbank.ca/
9
https://pubchem.ncbi.nlm.nih.gov/
10
https://en.wikipedia.org/wiki/Category:Clinical_medicine
11
http://gsr.lau.edu.lb/irb/forms/medical_lay_terms.pdf
12https://depts.washington.edu/respcare/public/info/Plain_Language_Thesaurus_for_
Health_Communications.pdf
96

UMLS Medical Concept Definitions: We extracted a total of 167,550 sentences

that defined medical terms in the UMLS Metathesaurus [31], a large biomedical thesaurus

consisting of millions of medical concepts and used by professionals for patient care and

public health.

Table 6.5. Medical concepts and similar words based on cosine similarity
obtained from word embeddings built with different health-related text cor-
pora.

Medical Clinical Text Medical Thesaurus Health-related


Concept Dictionary Tweets
(CT) (MD) (TH) (HT)
depression dysthymia arthritic recession boredom
anxiety mood disorder weightgain
schizophrenia-like diminution collapse obesityWHO
benzodiazepine-induced exertion lassitude irritability
hopelessness fatigue lethargy anxiety
insomnia apnea sleeplessness depressionchronic
derealization wakefulness migraines
sleep – restlessness weightgain
dysthymic hyperexcitability
awakening stressrelated
dizzy lightheaded verge woozy lightheaded
faint restless fainting nauseous
nauseated light-headed whirling headache
swaying lamely faint lethargic
shaky paranoia feeble sleepfeeling
myalgia backache arthralgia
arthralgia athralgia
asthenia – – muscleampjoint
aches odynophagia
fatigability bodymuscle
hypertension dyslipidemia arterial diseaseheart
renovascular hypotension diabetes
nephrosclerosis narrowing – dyslipidemia
beta-antagonists weakness pressurehigh
Gestosis diallation arteriosclerosis
97

6.4.2.4. Health-related Tweets (HT). We collected 100 million publicly available

health-related tweets that mentioned 116 common diseases and symptoms (e.g., flu, de-

pression, insomnia, diabetes, obesity, heart disease, anxiety disorder, etc.) using the Twit-

ter streaming API13, which provided approximately 1% of all publicly available tweets.

As preprocessing steps, we removed non-English tweets, tokenized the text, normalized

to lowercase, and replaced hyperlinks, numerics and Twitter screen names with special

tokens: ‘URL’, ‘NUMBER’ and ‘USER’.

Table 6.5 shows medical concepts and examples of top 20 similar words by cosine

similarity based on the word embeddings built with individual data source.

6.5. Results

Table 6.6. Classification Accuracy (%) using 10-fold cross validation (TH
= thesaurus, MD = medical dictionary, CT = clinical texts, HT = health-
related tweets, batch size = 50, number of epoch = 100, vector dimension
= 300)

TwADR-L TwADR-L AskAPatient AskAPatient


Word Embeddings CNN RNN CNN RNN
Rand 16.06 22.05 40.95 58.54
GNews 15.57 23.17 45.73 64.41
TH 14.43 20.43 32.66 57.17
MD 15.73 19.62 41.90 58.26
CT 14.77 22.21 45.49 61.81
HT 16.69 24.63 45.46 64.08
TH + MD + CT + HT 19.46 25.30 55.46 65.04

Table 6.6 shows the accuracy of classification models using 10-fold cross validation,

averaged over ten folds. The first two rows are our baseline models14 [70] where CNN and

13https://dev.twitter.com/streaming/public
14Code available at https://github.com/nutli/concept_normalisation
98

RNN models use a randomly generated embeddings (Rand) and a publicly available pre-

trained word embeddings generated from 100 billion words from Google News (GNews)

using word2vec [77] as inputs. The next four rows (rows 3-6) present the performance

of the same CNN and RNN as the baseline models but using word embeddings we built

on top of various clinical texts described in section 6.4.2. The last row presents the

performance when the models use word embeddings built using combination of all four

data sources as an input. All experiments including the baseline models were trained and

evaluated on the cleaned and newly-created datasets (described in section 6.4.1).

Among the individual datasets (TH, MD, CT, HT), the health-related tweets (HT)

had the most significant impact on the classification performance. Both the CNN and the

RNN models performed comparable to (for AskAPatient dataset) or better (for TwADR-

L dataset) than the best baseline models. When we combined all individual datasets, it

largely improved the classification accuracy over all baseline models and all our models.

Compared to the best baseline accuracy, the improvement was +21.17% on TwADR-

L CNN, +9.19% on TwADR-L RNN, +21.28% on AskAPatient CNN, and +0.98% on

AskAPatient RNN. The improvement was substantial for CNN. For all models, we used

the following hyperparameters: batch size = 50, number of epochs = 100, vector dimension

= 300, number of neurons in hidden layer = 100, dropout rate = 0.5, non-linear activation

function = rectifier, and max-pooling for CNN.

6.5.1. Ablation Study

Next we conducted experiments to study the effects of removing a dataset from training.

Table 6.7 presents the performance loss when each dataset is removed from the set of all
99

Table 6.7. Ablation Study. Comparison of models’ accuracy (%) when a


feature is removed from all possible feature sets (TH = thesaurus, MD =
medical dictionary, CT = clinical texts, HT = health-related tweets). The
numbers in parenthesis indicate the performance drop when the feature is
removed.

TwADR-L TwADR-L AskAPatient AskAPatient


Word Embeddings CNN RNN CNN RNN
All - HT 18.80 (-0.66) 22.54 (-2.76) 46.37 (-9.09) 62.97 (-2.07)
All - TH 16.38 (-3.08) 25.44 (+0.14) 45.29 (-10.17) 62.96 (-2.08)
All - CT 15.58 (-3.88) 24.96 (-0.34) 45.61 (-9.85) 64.09 (-0.95)
All - MD 17.69 (-1.77) 26.60 (+1.3) 44.50 (-10.96) 63.93 (-1.11)
All 19.46 25.30 55.46 65.04

possible resources (TH + MD + CT + HT). Interestingly, each of the four data sources

appeared to be the most important for different deep learning models and datasets. The

performance dropped by 3.88% (from 19.46% to 15.58%) when clinical texts (CT) was

removed, indicating that CT is the most important feature for TwADR-L CNN among

the four individual features. For TwADR-L RNN, health-related tweets (HT) was the

most helpful feature, indicated by the performance drop of 2.76% when removed.

While the definitions from the medical dictionary (MD) contributed the most for

AskAPatient CNN model (with 10.96% performance drop when removed), the definitions,

synonyms, and antonyms from the thesaurus (TH) was the most significant feature for the

AskAPatient RNN model (with 2.08% performance drop when removed). These results

indicate that each text data from different healthcare domain is very helpful for the deep

learning models learn clinical semantics for normalization. Word embeddings built with

the larger dataset that combined texts from multiple healthcare domains significantly con-

tributed to improving model’s performance across both Twitter and AskAPatient datasets

when compared to that built from a larger general domain corpus like google news.
100

Table 6.8. TwADR-L examples that should have multiple labels

Social Media Phrase CUI Concept CUI Concept


Gold Gold Predicted Predicted
feel like crap C0011570 mental depression C0344315 depressed mood
not being able to eat C1971624 loss of appetite C0232462 decrease in appetite
feeling weird C1443060 feeling abnormal C0278061 abnormal mental state
depressive emotions and thoughts C0011570 mental depression C0086132 depressive symptoms
wide awake C0455769 energy increased C0043012 wakefulness

6.5.2. Qualitative Analysis

Table 6.8 shows examples that our best model incorrectly predicted. The first column

shows example phrases of social media posts that describe medical conditions, the sec-

ond and the third columns show the annotated CUIs (unique concept identifiers) and

corresponding medical concept descriptions, and the fourth and fifth columns show the

predicted CUIs and corresponding concept descriptions by our best model (TH + MD +

CT + HT). These examples are false positives based on the ground truth labels (i.e., the

predicted CUIs do not match the labeled CUIs). However, we can observe that, although

the CUIs are different, the social media phrases can actually be mapped to both predicted

and labeled concepts. For example, the predicted concept ‘decrease in appetite’ and the

label ‘loss of appetite’ have similar meanings, therefore predicting the phrase ‘not being

able to eat’ as the concept ‘decrease in appetite’ should be considered correct. While

some phrases in the dataset have multiple labels, there are still many more that should

have multiple labels (such as those shown in Table 6.8).

This suggests several future directions for designing a normalization system. First, it

is necessary to have a list of CUIs that represent similar medical concepts so that, when a

normalization system predicts a CUI, the mapping can automatically be associated with

other CUIs in the same set. Second, the normalization task should be cast as a multi-class
101

multi-label classification problem since each phrase can be mapped to multiple concepts

(as shown in Tables 6.3 and 6.8) and each concept can have many social media phrases

(as shown in Table 6.1).

6.6. Summary

In this work, we explored building neural word embeddings using unlabeled text data

from various clinical knowledge sources for medical concept normalization from user-

generated social media texts. We showed that two deep learning models (CNN and

RNN) could better predict the medical concepts when we used various clinical domain-

specific neural embeddings compared to embeddings trained on a larger general domain

text corpus. Our experiments showed that the proposed models with neural embeddings

trained on the combined clinical data sources could improve the accuracy up to 21.17%

on the Twitter data set and up to 21.28% on the AskAPatient data set.
102

CHAPTER 7

Conclusion and Future Research Work

Social media is an invaluable resource for mining healthcare insights. In this thesis,

we presented intelligent systems we built using Twitter data for retrieving health-related

information, monitoring and predicting disease activities, and normalizing medical con-

cepts. We proposed a multi-class classification model that classified trending topics or

Twitter posts into 18 general categories. Although both our approaches, bag-of-words and

network-based, effectively classified topics with high accuracy, the network-based model

using categories of similar topics had shown to achieve superior classification performance.

This model could help search information in a specific domain such as health. We also

discussed our contributions towards building a real-time disease surveillance system using

spatial, temporal, and text mining on Twitter data. The proposed system could effec-

tively track daily disease activities and map U.S. regional disease levels near real-time.

Although our work had focused on tracking three diseases (allergy, cancer, flu), the model

could be easily adapted to track other diseases. We further built a neural network model

that predicted current and future influenza activities with high accuracy by combining big

real-time social media data and observed CDC data to build predictive models. Finally,

we investigated normalizing health conditions described in the colloquial language to stan-

dard medical terminologies in Unified Medical Language System (UMLS). By training two

deep learning, Convolutional neural network (CNN) and recurrent neural network (RNN),
103

models on various clinical knowledge sources, we were able to achieve significantly better

results over the baseline techniques.

Although some pioneering works have been done, there still remain many challenges in

mining social media to gain healthcare insights. Medical concept extraction - identifying

phrases that describe health conditions - is a challenging task due to many different

ways of describing the same condition and the colloquial language used in social media.

Generating novel models for medical concept extraction using advanced natural language

processing (NLP) techniques and deep learning would be an interesting research area for

future work. Such models would be helpful for automatic systems to understand users’

health issues or clinical questions accurately such that they can provide more relevant

information to the users. We are also interested in automatically detecting mentions

of adverse drug events (ADE), negative side effects that occur as a result of medical

interventions, from social media.

We proposed a number of techniques that could be useful for collecting, analyzing

and predicting health-related information from real-time social media. However, there

still remain many challenging problems related to mining user generated contents for

healthcare insights. We hope that the techniques we proposed in this thesis can be used

as a stepping stone to further address some of those research questions.


104

References

[1] K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and A. Choud-

hary. Twitter trending topic classification. In Data Mining Workshops (ICDMW),

2011 IEEE 11th International Conference on, pages 251–258. IEEE, 2011.

[2] K. Lee, A. Agrawal, and A. Choudhary. Real-time digital flu surveillance using

twitter data. In The 2nd Workshop on Data Mining for Medicine and Healthcare,

2013.

[3] K. Lee, A. Agrawal, and A. Choudhary. Real-time disease surveillance using twitter

data: Demonstration on flu and cancer. In Proceedings of the 19th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, KDD ’13,

pages 1474–1477, New York, NY, USA, 2013. ACM.

[4] K. Lee, A. Agrawal, and A. Choudhary. Mining social media streams to improve

public health allergy surveillance. In 2015 IEEE/ACM International Conference on

Advances in Social Networks Analysis and Mining (ASONAM), pages 815–822, Aug

2015.

[5] K. Lee, A. Agrawal, and A. Choudhary. Forecasting influenza levels using real-

time social media streams. In 2017 IEEE International Conference on Healthcare

Informatics (ICHI), pages 409–414, Aug 2017.


105

[6] K. Lee, S. A. Hasan, O. Farri, A. Choudhary, and A. Agrawal. Medical concept nor-

malization for online user-generated texts. In 2017 IEEE International Conference

on Healthcare Informatics (ICHI), pages 462–469, Aug 2017.

[7] T. Zhu, H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. N. Choudhary.

Beating the artificial chaos: Fighting osn spam using its own templates. IEEE/ACM

Transactions on Networking, 24(6):3856–3869, December 2016.

[8] H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. Choudhary. Spam

ain’t as diverse as it seems: Throttling osn spam with templates underneath. In

Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC

’14, pages 76–85, New York, NY, USA, 2014. ACM.

[9] D. Palsetia, M. Mostofa, A. Patwary, K. Zhang, K. Lee, C. Moran, Y. Xie, D. Honbo,

A. Agrawal, W. keng Liao, and A. Choudhary. User-interest based community ex-

traction in social networks. 2012.

[10] A. Choudhary, W. Hendrix, K. Lee, D. Palsetia, and W.-K. Liao. Social media

evolution of the egyptian revolution. Commun. ACM, 55(5):74–80, May 2012.

[11] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Towards online spam fil-

tering in social networks. In Proceedings of the 19th Annual Network and Distributed

System Security Symposium, 2012.

[12] K. Zhang, Y. Cheng, Y. Xie, D. Honbo, A. Agrawal, D. Palsetia, K. Lee, W. k. Liao,

and A. Choudhary. Ses: Sentiment elicitation system for social media data. In 2011
106

IEEE 11th International Conference on Data Mining Workshops, pages 129–136,

Dec 2011.

[13] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Poster: Online spam

filtering in social networks. In Proceedings of the 18th ACM Conference on Computer

and Communications Security, CCS ’11, pages 769–772, New York, NY, USA, 2011.

ACM.

[14] Real-time digital allergy surveillance. http://pulse.eecs.northwestern.edu/

~kml649/allergy/.

[15] Real-time digital cancer surveillance. http://pulse.eecs.northwestern.edu/

~kml649/cancer/.

[16] Real-time digital flu surveillance. http://pulse.eecs.northwestern.edu/

~kml649/flu/.

[17] Twitter streaming api. https://dev.twitter.com/docs/streaming-apis.

[18] Centers for Disease Control and Prevention, seasonal influenza (flu). http://www.

cdc.gov/flu, 2012.

[19] World of DTC Marketing.com, web first place people go for health information. but

you knew that already didn’t you. http://worldofdtcmarketing.com, 2012.


107

[20] The Huffington Post, michigan flu season 2013: Four children die in

influenza outbreak of ah3n2. http://www.huffingtonpost.com/2013/01/12/

michigan-flu-season-2013-ah3n2_n_2458916.html, 2013.

[21] USA Today, 700 cases of flu prompt boston to declare emer-

gency. http://www.usatoday.com/story/news/nation/2013/01/09/

boston-declares-flu-emergency/1820975, 2013.

[22] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Predicting flu trends us-

ing twitter data. In Computer Communications Workshops (INFOCOM WKSHPS),

2011 IEEE Conference on, 2011.

[23] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine

learning, 6(1):37–66, 1991.

[24] E. Aramaki, S. Maskawa, and M. Morita. Twitter Catches the Flu: Detecting In-

fluenza Epidemics Using Twitter. In Proceedings of the Conference on Empirical

Methods in Natural Language Processing, pages 1568–1576, 2011.

[25] A. R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus:

the MetaMap program. Proceedings / AMIA ... Annual Symposium. AMIA Sympo-

sium, pages 17–21, 2001.

[26] S. Asur and B. A. Huberman. Predicting the Future with Social Media. In Proceed-

ings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence

and Intelligent Agent Technology - Volume 01, pages 492–499, 2010.


108

[27] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event

identification on twitter. In Proceedings of AAAI, 2011.

[28] Y. Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with

Gradient Descent is Difficult. IEEE Transactions on Neural Networks, 5(2):157–

166, 1994.

[29] D. L. Blackwell, J. W. Lucas, and T. C. Clarke. Summary health statistics for u.s.

adults: National health interview survey, 2012. http://www.cdc.gov/nchs/data/

series/sr_10/sr10_260.pdf, 2013.

[30] B. Bloom, L. I. Jones, and G. Freeman. Summary health statistics for u.s. children:

National health interview survey, 2012. http://www.cdc.gov/nchs/data/series/

sr_10/sr10_258.pdf, 2012.

[31] O. Bodenreider. The unified medical language system (umls): integrating biomedical

terminology. Nucleic acids research, 32:D267–D270, 2004.

[32] J. Bollen and H. Mao. Twitter mood as a stock market predictor. Computer,

44(10):91–94, 2011.

[33] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal

of Computational Science, 2(1):1 – 8, 2011.

[34] E. G. Brown, L. Wood, and S. Wood. The medical dictionary for regulatory activities

(meddra). Drug Safety, 20(2):109–117, 2012.


109

[35] D. Butler. When Google got flu wrong. Nature, 494(7436):155–156, Feb. 2013.

[36] M. Cai, Q. Xu, Y. Pan, W. Pan, N. Ji, Y. Li, H. Jin, K. Liu, and Z. Ji. Adrecs:

an ontology database for aiding standardization and hierarchical classification of

adverse drug reaction terms. Nucleic Acids Research, 43(Database-Issue):907–913,

2015.

[37] D. Capurro, K. Cole, I. M. Echavarrı́a, J. Joe, T. Neogi, and M. A. Turner. The Use

of Social Networking Sites for Public Health Practice and Research: A Systematic

Review. J Med Internet Res, 16(3):e79, Mar 2014.

[38] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Chakraborty,

P. Khadivi, B. Lewis, A. Mahendiran, and J. C. and. Forecasting a Moving Target:

Ensemble Models for ILI Case Count Predictions. In SDM, 2014.

[39] L. Chen, H. Achrekar, B. Liu, and R. Lazarus. Vision: Towards real time epidemic

vigilance through online social networks: Introducing sneft – social network enabled

flu trends. In Proceedings of the 1st ACM Workshop on Mobile Cloud Computing

&#38; Services: Social Networks and Beyond, MCS ’10, pages 4:1–4:5, New York,

NY, USA, 2010. ACM.

[40] L. Chen, K. S. M. T. Hossain, P. Butler, N. Ramakrishnan, and B. A. Prakash. Flu

gone viral: Syndromic surveillance of flu on twitter using temporal topic models.

In 2014 IEEE International Conference on Data Mining, ICDM 2014, Shenzhen,

China, December 14-17, 2014, pages 755–760, 2014.


110

[41] C. Chew and G. Eysenbach. Pandemics in the Age of Twitter: Content Analysis of

Tweets during the 2009 H1N1 Outbreak. PLoS ONE, 5(11):e14118, 11 2010.

[42] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the Properties of Neu-

ral Machine Translation: Encoder–Decoder Approaches. In Proceedings of SSST-8,

Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation,

pages 103–111, 2014.

[43] N. A. Christakis and J. H. Fowler. Social network sensors for early detection of

contagious outbreaks. PloS one, 5(9):e12948, 2010.

[44] N. Cristianini and J. Shawe-Taylor. An introduction to support Vector Machines:

and other kernel-based learning methods. Cambridge Univ Pr, 2000.

[45] L. de Weger, T. Beerthuizen, P. Hiemstra, and J. Sont. Development and validation

of a 5-day-ahead hay fever forecast for patients with grass-pollen-induced allergic

rhinitis. International Journal of Biometeorology, 58(6):1047–1055, 2014.

[46] J. Emberlin, J. Mullins, J. Corden, W. Millington, M. Brooke, M. Savage, and

S. Jones. The trend to earlier birch pollen seasons in the uk: a biotic response to

changes in weather conditions? Grana, 36(1):29–33, 1997.

[47] J. U. Espino, W. R. Hogan, and M. M. Wagner. Telephone triage: a timely data

source for surveillance of influenza-like diseases. In AMIA Annual Symposium Pro-

ceedings, page 215, 2003.


111

[48] G. Eysenbach. Infodemiology: tracking flu-related searches on the web for syndromic

surveillance. In AMIA Annual Symposium Proceedings, page 244, 2006.

[49] Y. Genc, Y. Sakamoto, and J. V. Nickerson. Discovering context: Classifying tweets

through a semantic transform based on wikipedia. In Proceedings of HCI Interna-

tional, 2011.

[50] J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, and L. Brilliant.

Detecting influenza epidemics using search engine query data. Nature, 457:1012–

1014, 2009.

[51] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant

supervision, 2009.

[52] H. Gurulingappa, A. M. Rajput, A. Roberts, J. Fluck, M. Hofmann-Apitius, and

L. Toldo. Development of a benchmark corpus to support the automatic extraction

of drug-related adverse effects from medical case reports. Journal of Biomedical

Informatics, pages 885 – 892, 2012.

[53] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The

WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 11(1):10–18,

Nov. 2009.

[54] S. A. Hasan, B. Liu, J. Liu, A. Qadir, K. Lee, V. Datla, A. Prakash, and O. Farri.

Neural clinical paraphrase generation with attention. ClinicalNLP 2016, page 42,

2016.
112

[55] A. Hulth, G. Rydevik, and A. Linde. Web queries as a source for syndromic surveil-

lance. PloS one, 4(2):e4378, 2009.

[56] IBM SPSS Modeler. http://www-01.ibm.com/software/analytics/spss/

products/modeler/.

[57] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time

of Outbreaks. In Proceedings of the 22Nd International Conference on World Wide

Web Companion, pages 1335–1342, 2013.

[58] S. Karimi, A. Metke-Jimenez, M. Kemp, and C. Wang. Cadec: A corpus of adverse

drug event annotations. Journal of Biomedical Informatics, 55:73 – 81, 2015.

[59] Y. Kim. Convolutional neural networks for sentence classification. In Proceedings

of the 2014 Conference on Empirical Methods in Natural Language Processing

(EMNLP), Doha, Qatar, 2014.

[60] S. Kinsella, A. Passant, and J. G. Breslin. Topic classification in social media using

metadata from hyperlinked objects. In Proceedings of the 33rd European conference

on Advances in information retrieval, pages 201–206, 2011.

[61] P. Kostkova. A Roadmap to Integrated Digital Public Health Surveillance: The

Vision and the Challenges. In Proceedings of the 22nd International Conference on

World Wide Web Companion, pages 687–694, 2013.

[62] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork. The SIDER database of drugs and

side effects. Nucleic Acids Research, 44(Database-Issue):1075–1079, 2016.


113

[63] A. Lamb, M. J. Paul, and M. Dredze. Separating Fact from Fear: Tracking Flu

Infections on Twitter. In HLT-NAACL, pages 789–795, 2013.

[64] C. E. Lamb, P. H. Ratner, C. E. Johnson, A. J. Ambegaonkar, A. V. Joshi, D. Day,

N. Sampson, and B. Eng. Economic impact of workplace productivity losses due to

allergic rhinitis compared with select medical conditions in the united states from

an employer perspective. Current Medical Research and Opinion, 22(6):1203–1210,

2006. PMID: 16846553.

[65] V. Lampos, T. De Bie, and N. Cristianini. Flu detector-tracking epidemics on Twit-

ter. In Machine Learning and Knowledge Discovery in Databases, pages 599–602.

2010.

[66] S. Le Cessie and J. Van Houwelingen. Ridge estimators in logistic regression. Applied

Statistics, pages 191–201, 1992.

[67] R. Leaman, R. I. Dogan, and Z. Lu. Dnorm: disease name normalization with

pairwise learning to rank. Bioinformatics, 29(22):2909–2917, 2013.

[68] K. Lee, A. Qadir, S. A. Hasan, V. Datla, a. prakash, J. Liu, and O. Farri. Ad-

verse drug event detection in tweets with semi-supervised convolutional neural net-

works. In Proceedings of the Twenty-Sixth International World Wide Web conference

(WWW 2017), Perth, Australia, 2017.

[69] J. Li and C. Cardie. Early Stage Influenza Detection from Twitter. arXiv preprint

arXiv:1309.7340, 2013.
114

[70] N. Limsopatham and N. Collier. Normalising medical concepts in social media texts

by learning semantic representation. In Proceedings of the 54th Annual Meeting

of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,

Berlin, Germany, Volume 1: Long Papers, 2016.

[71] D. Lindberg, B. Humphreys, and A. McCray. The Unified Medical Language System.

Methods of Information in Medicine, 32(4):281–291, 1993.

[72] S. Magruder. Evaluation of over-the-counter pharmaceutical sales as a possible early

warning indicator of human disease. Johns Hopkins APL technical digest, 24(4):349–

53, 2003.

[73] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval.

Cambridge University Press, New York, NY, USA, 2008.

[74] A. McCallum, K. Bellare, and F. C. N. Pereira. A conditional random field

for discriminatively-trained finite-state string edit distance. CoRR, abs/1207.1406,

2012.

[75] A. McCallum and K. Nigam. A comparison of event models for naive bayes text

classification. In IN AAAI-98 WORKSHOP ON LEARNING FOR TEXT CATE-

GORIZATION, pages 41–48. AAAI Press, 1998.

[76] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-

sentations in vector space. CoRR, abs/1301.3781, 2013.


115

[77] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed represen-

tations of words and phrases and their compositionality. In Proceedings of the 27th

Annual Conference on Neural Information Processing Systems NIPS 2013, 2013.

[78] R. Narayanan. Mining Text for Relationship Extraction and Sentiment Analysis.

PhD thesis, 2010.

[79] A. Nikfarjam, A. Sarker, K. O’Connor, R. Ginn, and G. Gonzalez. Pharmacovig-

ilance from social media: mining adverse drug reaction mentions using sequence

labeling with word embedding cluster features. Journal of the American Medical

Informatics Association, 22:671–681, 2015.

[80] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using

machine learning techniques. In Proceedings of the ACL-02 conference on Empirical

methods in natural language processing-Volume 10, pages 79–86. Association for

Computational Linguistics, 2002.

[81] M. J. Paul and M. Dredze. You Are What You Tweet: Analyzing Twitter for Public

Health. In ICWSM, 2011.

[82] R. Pawankar, G. W. Canonica, S. T. Holgate, and R. F. Lockey.

Wao white book on allergy. http://www.worldallergy.org/UserFiles/file/

WAO-White-Book-on-Allergy_web.pdf, 2011.

[83] F. Pervaiz, M. Pervaiz, N. Abdur Rehman, and U. Saif. FluBreaks: Early Epidemic

Detection from Google Flu Trends. J Med Internet Res, 14(5):e125, Oct 2012.
116

[84] P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A. Weinstein. Using

internet searches for influenza surveillance. Clinical infectious diseases, 47(11):1443–

1448, 2008.

[85] A. Prakash, S. A. Hasan, K. Lee, V. V. Datla, A. Qadir, J. Liu, and O. Farri.

Neural paraphrase generation with stacked residual LSTM networks. In COLING

2016, 26th International Conference on Computational Linguistics, Proceedings of

the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages

2923–2934, 2016.

[86] A. Prakash, S. Zhao, S. A. Hasan, V. Datla, K. Lee, A. Qadir, and O. F. Joey Liu.

Condensed memory networks for clinical diagnostic inferencing. In The 31st AAAI

Conference on Artificial Intelligence (AAAI 2017), 2017.

[87] J. Quinlan. Improved use of continuous attributes in c4.5. Arxiv preprint

cs/9603103, 1996.

[88] E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 20(5):522–532, May 1998.

[89] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time

event detection by social sensors. In Proceedings of the 19th international conference

on World wide web, pages 851–860, 2010.


117

[90] H. Sampathkumar, X. Chen, and B. Luo. Mining adverse drug reactions from on-

line healthcare forums using hidden markov model. BMC Medical Informatics and

Decision Making, 14(1):1–18, 2014.

[91] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling.

Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL Inter-

national Conference on Advances in Geographic Information Systems, pages 42–51,

2009.

[92] A. Sarker and G. Gonzalez. Portable automatic text classification for adverse drug

reaction detection via multi-corpus training. Journal of Biomedical Informatics,

53:196 – 207, 2015.

[93] J. Shaman, A. Karspeck, W. Yang, J. Tamerius, and M. Lipsitch. Real-time in-

fluenza forecasts during the 2012–2013 season. Nature Communications, 4, Dec.

2013.

[94] R. L. Siegel, K. D. Miller, and A. Jemal. Cancer statistics, 2017. CA: A Cancer

Journal for Clinicians, 67(1):7–30, 2017.

[95] A. Signorini, A. M. Segre, and P. M. Polgreen. The Use of Twitter to Track Levels

of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1

Pandemic. PLoS ONE, 6(5):e19467, 05 2011.

[96] M. Sofean and M. Smith. A real-time architecture for detection of diseases using

social networks: Design, implementation and evaluation. In Proceedings of the 23rd


118

ACM Conference on Hypertext and Social Media, HT ’12, pages 309–310, New York,

NY, USA, 2012. ACM.

[97] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text

classification in twitter to improve information filtering. In Proceeding of the 33rd

international ACM SIGIR conference on Research and development in information

retrieval, pages 841–842, 2010.

[98] R. Sugumaran and J. Voss. Real-time Spatio-temporal Analysis of West Nile Virus

Using Twitter Data. In Proceedings of the 3rd International Conference on Com-

puting for Geospatial Research and Applications, pages 39:1–39:2, 2012.

[99] S. Tuarob, C. S. Tucker, M. Salathe, and N. Ram. Discovering health-related knowl-

edge in social media using ensembles of heterogeneous features. In Proceedings of

the 22Nd ACM International Conference on Information & Knowledge Manage-

ment, CIKM ’13, pages 1685–1690, New York, NY, USA, 2013. ACM.

[100] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka/.

[101] M. Wilson, S. Villalba, H. Avila, J. Hahn, and A. Cepeda. Correlation between

atmospheric tree pollen levels with three weather variables during 2002-2004 in a

tropical urban area. Journal of Allergy and Clinical Immunology, 127(2):AB170,

2011.

[102] W. Xing and A. Ghorbani. Weighted pagerank algorithm. 2004.


119

[103] S. R. Yerva, Z. Miklós, and K. Aberer. What have fruits to do with technology?: the

case of orange, blackberry and apple. In Proceedings of the International Conference

on Web Intelligence, Mining and Semantics, 2011.

[104] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S. Brownstein. Mon-

itoring Influenza Epidemics in China with Search Query from Baidu. PLoS ONE,

8:e64323, 05 2013.

You might also like