Mining Social Media For Healthcare Intelligence PDF

NORTHWESTERN UNIVERSITY
Mining Social Media for Healthcare Intelligence
A DISSERTATION
SUBMITTED TO THE GRADUATE SCHOOL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
for the degree
DOCTOR OF PHILOSOPHY
Field of Computer Science
By
Kathy Lee
EVANSTON, ILLINOIS
December 2017

ProQuest Number: 10638877

All rights reserved

INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

ProQuest 10638877

Published by ProQuest LLC (2018 ). Copyright of the Dissertation is held by the Author.

All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
2

c Copyright by Kathy Lee 2017
All Rights Reserved

3
ABSTRACT
Mining Social Media for Healthcare Intelligence
Kathy Lee
Social media such as Twitter has risen as a powerful new communication medium
for disseminating information on news, personal interests, experiences, and opinions. On
social media, people talk about their lifestyle, health conditions and symptoms, search
information on treatment options, and connect with people who have been through simi-
lar medical experiences to get emotional support. Such health information generated by
patients or family members is not available in medical documents created by health care
providers and became publicly available only recently with the prevalent use of microblog-
ging sites, which makes social media an invaluable source of health data to mine. However,
social media data is often short, unstructured, and written in colloquial languages, and
these characteristics pose many interesting research questions.
In this thesis, we focused on mining public Twitter data for healthcare intelligence.
We designed models based on bag-of-words and social network structure features that
classify trending topics into general categories such as sports, technology and health.
This model could help identify trending topics and posts in health domain and benefit
4
information retrieval tasks by reducing the search space to a domain of interest. We
also proposed a real-time digital disease surveillance system that uses spatial, temporal,
and text mining techniques to track disease activities. Our work was motivated by the
fact that, while traditional disease surveillance systems require 1-2 weeks time to collect
and process before the data becomes publicly available, Twitter data is available near
real-time and the aggregated social media data can provide an overall health state of
the general population earlier than the traditional disease surveillance systems can. We
further built a neural network model that combines Twitter data with the observed data
from Centers for Disease Control and Prevention (CDC) to predict current and future
influenza activities. Our system can serve as a proxy for early detection of pandemics and
the resulting insights are expected to help facilitate faster response to and preparation
for epidemics. We also investigated the use of clinical knowledge sources to train deep
learning models for medical concept normalization in which health conditions described
in natural (colloquial) language are mapped to a standard clinical term. The proposed
model can help an automatic system to effectively interpret health concepts written in
layman’s language.
The studies presented in this thesis provide interesting insights into the application
of machine learning and text mining on social media data in healthcare domain. We
hope our work motivates further study of online user-generated data to gain meaningful
healthcare insights.
5
Acknowledgements
I would like to thank Prof. Alok Choudhary for advising this research and for his
constant guidance and valuable feedback. He has inspired me to work on research problems
in social media and healthcare domain. He has always given me a steady force to maintain
momentum in my research. I would also like to express my deep gratitude towards my
thesis committee members Prof. Wei-keng Liao and Prof. Ankit Agrawal.
I wish to thank my parents for their endless love and sacrifice, and providing me with
the best education. I dedicate this thesis to them. I also wish to thank my husband Seung
Woo and my two children Daniel and Ashley for patiently supporting me. Without their
encouragement and dedication, I would not have been able to successfully finish this long
journey.
Lastly, I would like to thank all members of the center for ultra-scale computing and
information security (CUCIS) lab at Northwestern University for collaborating with me
and providing me an invaluable intellectual support.

6
Table of Contents
ABSTRACT 3
Acknowledgements 5
Table of Contents 6
List of Tables 10
List of Figures 13
Chapter 1. Introduction 18
Chapter 2. Twitter Trending Topic Classification 22
2.1. Introduction 22
2.2. Related Works 25
2.3. Data and Methods 26
2.3.1. Data Collection 27
2.3.2. Labeling 28
2.3.3. Data Modeling 30
2.3.3.1. Text-based Data Modeling 30
2.3.3.2. Network-based Data Modeling 31
2.3.4. Machine Learning 33
2.4. Experiments and Results 34

7
2.4.1. Text-based classification 34
2.4.2. Network-based classification 35
2.5. Summary 36
Chapter 3. Mining Social Media Streams to Improve Public Health
Allergy Surveillance 38
3.2. Our Approach 40
3.2.1. Datasets 40
3.2.1.1. Twitter dataset 40
3.2.1.2. Ground Truth Data 40
3.2.2. Methodology 41
3.2.2.1. Data Preprocessing 41
3.2.2.2. Data Classification 41
3.2.2.3. Text Mining 44
3.2.2.4. Spatio-temporal Mining 46
3.3. Experimental Results 48
3.3.1. Text Analysis 48
3.3.2. Spatio-Temporal Analysis 50
3.4. Related Works 55
3.5. Summary 57
Chapter 4. Real-Time Digital Diseases Surveillance using Twitter Data
Demonstration on Flu and Cancer 59

8
4.2. System Description 60
4.2.1. Geographical Analysis 62
4.2.2. Temporal Analysis 63
4.2.3. Text Analysis 66
4.3. Summary 67
Chapter 5. Forecasting Influenza Levels using Real-Time Social Media Streams 68
5.2. Related Work 70
5.3. Method 72
5.3.1. Dataset 72
5.3.2. Data Preprocessing 73
5.3.3. Feature Selection 74
5.3.4. Predictive Modeling 77
5.4. Results 77
5.5. Summary 81
Chapter 6. Medical Concept Normalization 82
6.2. Related Work 85
6.2.1. Social Media for Healthcare 85
6.2.2. Deep Neural Network Models 85
6.2.3. Concept Normalization 86

9
6.3. Model Description 87
6.3.1. Convolutional Neural Network (CNN) 87
6.3.2. Recurrent Neural Network (RNN) 88
6.4. Experimental Setup 90
6.4.1. Data 90
6.4.2. Data Sources for Word Embedding 92
6.4.2.1. Thesaurus (TH) 93
6.4.2.2. Medical Dictionary (MD) 94
6.4.2.3. Clinical Texts (CT) 94
6.4.2.4. Health-related Tweets (HT) 97
6.5. Results 97
6.5.1. Ablation Study 98
6.5.2. Qualitative Analysis 100
6.6. Summary 101
Chapter 7. Conclusion and Future Research Work 102
References 104
10
List of Tables
2.1 Five most similar topics of topic “macbook” in class technology. 33
3.1 Tweets with positive and negative labels. A tweet is positive if it talks
about the author or someone around the author having allergy. A tweet
is negative if it is a question or talks about news, general awareness or
information about allergies. 42
3.2 Classification performance of various classifiers using 10-fold cross
validation. The best classification performance (F-measure of 0.811 and
ROC area of 0.905) was obtained using NaiveBayesMultinomial (NBM). 42
3.3 A list of most frequently used bigrams where the second word is allergy,
ranked by frequency of use in the entire allergy corpus. It includes many
actual allergy types that are in ‘noun noun’ POS tag. 45
3.4 30 most frequently mentioned allergy types automatically extracted
by our algorithm. Numbers indicate the rank of frequency the 2-gram
appears in the allergy corpus and +/- signs indicates whether it is
an actual allergy type(+) or not(-). 26 out of 30 were true positives
achieving a precision of 86.7%. 47

11
3.5 Most prevalent food allergies. The rank of the most prevalent food
allergies extracted from Twitter data is very similar to that obtained
from actual allergy patients’ data. 47
5.1 Examples of flu-related tweets. 74
5.2 CDC and Twitter features used in flu prediction model. 75
5.3 Twitter data improves prediction performance. 76
5.4 Comparison of current flu forecast model’s performance when different
learning rates and a varying number of hidden layers and hidden units
are used. The highest correlation of 0.9559 was obtained using learning
rate λ = 0.2 and one hidden layer with 4 activation units. 76
5.5 Comparison of 1-week ahead flu forecast model’s performance when
different learning rates and a varying number of hidden layers and
hidden units are used. The highest correlation of 0.929 was obtained
using learning rate λ = 0.2 and one hidden layer with 4 activation units. 76
6.1 Medical concepts in UMLS and example social media phrases that
describe the medical concept 83
6.2 Data Statistics after removing duplicates from the combined training,
validation, and test data 90
6.3 Examples of phrases with multiple labels 91
6.4 Data Statistics after removing concepts that had less than five examples 92
12
6.5 Medical concepts and similar words based on cosine similarity obtained
from word embeddings built with different health-related text corpora. 96
6.6 Classification Accuracy (%) using 10-fold cross validation (TH =
thesaurus, MD = medical dictionary, CT = clinical texts, HT =
health-related tweets, batch size = 50, number of epoch = 100, vector
dimension = 300) 97
6.7 Ablation Study. Comparison of models’ accuracy (%) when a feature is
removed from all possible feature sets (TH = thesaurus, MD = medical
dictionary, CT = clinical texts, HT = health-related tweets). The
numbers in parenthesis indicate the performance drop when the feature
is removed. 99
6.8 TwADR-L examples that should have multiple labels 100

13
List of Figures
2.1 Tweets related to Trending Topic Boone Logan. 23
2.2 System Architecture. 26
2.3 Web interface deployed for manual labeling. Annotators read the trend
definition and tweets before labeling trending topics as one of the 18
classes. 28
2.4 Distribution of 768 topics across 18 classes. 29
2.5 Word cloud of trending topics in technology class 30
2.6 Trending topic “macbook” and its 5 similar topics 32
2.7 Text-based accuracy comparison over different classification techniques. 34
2.8 Network-based accuracy comparison over different classification
techniques. 36
3.1 Time-series graph of daily allergy levels detected in tweets (February
2013 - April 2015). Only those allergy-related tweets labeled as positive
are used to create the graph. The graph illustrates the general allergy
level trend over time. The allergy level is the highest in mid–May, goes
down in June and July, starts rising again in August, and reaches its
14
local maximum point in mid–September. Similar seasonal patterns are
observed in both 2013 and 2014. 49
3.2 Monthly average data for allergy tweet count (blue), daily highest
temperature (green), and pollen level (red) for Washington state (March
2013 – April 2015). Pollen level is highly correlated with ∆temperature
(correlation of 0.776) and ∆tweet count (correlation of 0.706). Tweet
count is very strongly correlated with temperature (correlation of
0.668). 50
3.3 Monthly distribution of mentions of peanut and pollen allergies (March
2013–April 2015). A huge seasonal variation is observed in monthly
pollen allergy (a seasonal allergy) level compared to that of peanut
allergy (a food allergy). 51
3.4 Time-series graph of tweet count for various allergy symptoms (Feb
2013–Sep 2014). The most common allergy symptom is sneezing (blue
line) throughout the year, followed by cough (green) and runny nose
(sky blue). 53
3.5 Distribution of allergy tweets with geolocations. The seasonal pattern
of allergy levels across U.S. is clearly visible. Allergy level is the highest
in spring and the lowest in winter. 54
3.6 Bar chart comparing monthly social-media-sensed peanut and gluten
allergy levels for each U.S. state. The tweet count is normalized by state
15
census population and scaled to range between 0 and 100. In most US
states, peanut allergy level is higher than gluten allergy level. 55
4.1 Real-Time Disease Surveillance System continuously downloaded flu
and cancer related tweets and applied geographical, temporal, and
text mining. The real-time analysis data was visually reported as
U.S. disease activity maps, timelines, and pie charts on our project
websites [15][16]. 61
4.2 Our Real-Time Digital Flu Surveillance Website [16]. The ‘Daily Flu
Activity’ chart was an output of the temporal analysis and showed
the volume changes of tweets mentioning the word ‘flu’ over time.
The dramatic increase of flu tweet volume from Jan. 6 to Jan. 12
coincided with the dates when the major U.S. newspapers reported
Boston Flu Emergency [21] and deaths of four children from the AH3N2
influenza outbreak [20]. The‘U.S. Flu Activity Map’ was an output
of the geographical analysis and showed the weighted percentage of
tweet volumes mentioning ‘flu’ by states. The level of flu activity was
differentiated by different colors for an easy comparison of U.S. regional
flu epidemic. 62
4.3 Flu Symptoms Timeline. The timeline displays tweet volume changes
mentioning different flu symptoms from January through March 2013.
‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
16
level in mid January and decrease as the actual national ILI level by
CDC decreases. 64
4.4 Distribution of Cancer Types in Tweets. 65
4.5 Distribution of Cancer Symptoms in Tweets . 65
4.6 Distribution of Cancer Treatments in Tweets . 65
4.7 Most Frequent Words in Flu Tweets. 66
5.1 Data collection and modeling process. Disambiguation, filtering and
network analysis were performed on continuously downloaded flu-related
tweets. Weekly time-series flu-related tweet counts were computed after
data was smoothed out to align with CDC data. Current and 1-week
ahead flu prediction models were built. 73
5.2 Data available at current week t. At the end of week t, all flu-related
Twitter data collected during current week t and prior are available. At
time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as
CDC’s collection, retrospective analysis and reports take two weeks. 75
5.3 Structure of multilayer perceptron used in our influenza activity forecast
model. 78
5.4 Comparison of our current and 1-week ahead U.S. influenza activity
forecast results against CDC and Google Flu Trends data. For current
week prediction, a correlation coefficient of 0.9522 over 52 training data
and a correlation coefficient of 0.929 over 19 held-out test data points
were obtained. For 1-week ahead forecast, a correlation coefficient of

17
0.895 over 52 training data and a correlation coefficient of 0.71 over 19
previously unseen test data points were obtained. 79
6.1 Generic convolutional neural network architecture. 87
6.2 Generic recurrent neural network architecture. 89
6.3 Definition, example sentence, synonyms, related words, near antonyms
and antonyms for the word ‘sore’ obtained from Merriam-Webster
Thesaurus. 93
6.4 Medical definition of the term ‘myalgia’ obtained from Merriam-Webster
Medical Dictionary. 93
18
CHAPTER 1
Introduction
Social media has gained popularity as a new means for information sharing in the
last decade. The rise of social media along with advancements of mobile technologies
such as smart phones and tablets has changed communication patterns among friends
and families.
Twitter is one of the largest microblogging social network where people post short
text messages called tweets. Users can subscribe to receive tweets by following other
users they are interested in. If user A selects to receive all tweets posted by user B, A
is called a follower and B is called a friend of A. User A can follow user B back, but is
not obligated to do so. Users can select to share information publicly or privately within
small social circles. By default, tweets are publicly viewable by others unless the user
sets his/her Twitter account private, which makes Twitter a great real-time resource for
information search where the latest news and events can be found faster than any other
media. People generally like to find out news at the exact moment they are happening,
read/write information at their convenient time, and search what they want to know.
Users create hashtags, a pound sign (#) followed by a word or un-spaced phrase, to
dynamically tag user-generated posts, which allows searching tweets on a specific topic or
theme easy. Retweet is a unique feature of Twitter that allows users conveniently share
information with their followers thereby letting information propagate at a speed faster
than other traditional media. On social media, users share news, events, experiences,
19
and opinions on various topics. The language used in Twitter has several characteristics.
While the 140 character limit on tweet text makes it fun and exciting for users to post a
tweet, it also causes the language to be short, noisy, prone to misspelling, and frequent
use of emojis and acronyms. Also, users often use multiple languages within a same post.
These pose many challenges for an automated system to accurately interpret meanings
of the sentences. These are relatively new problems generated by the unique ways of
interacting and communicating on social media.
Social media has a wide scope of applications. In business, it can be used for brand
awareness, targeted marketing, customer engagement and product reviews. In politics,
social media has been widely used for presidential campaigns, fundraising, and to measure
public opinions. In healthcare, patients use social media and online health forums to
search medical answers, seek medical advice on treatments, and to connect with other
patients for emotional support.
Twitter tracks trending topics to identify popular topics of discussion. Trending topics
can be unique to a specific geographic location or time, and the popularity is measured
by the volume of tweets mentioning a specific keywords or hashtags. We classify trending
topics into general categories such as sports, news, music, science, technology, health, and
so on, to provide readers more context and help narrow down the search space. We explore
social network features (Twitter friend/follower network structure) as well as traditional
n-gram features for trending topic classification.
Mining social media for healthcare insights is a relatively new research area that
have emerged with the rapid growth of microblogging services in the last decade. We
built a real-time digital disease surveillance system that constantly collects, analyzes, and
20
visualizes the aggregated data. We studied distribution of disease types, symptoms and
treatments social media users talk about on three common diseases: cancer, allergy, and
influenza.
Cancer is a disease that involves abnormal cell growth and is among the leading causes
of death worldwide.1 In 2017, 1,688,780 new cancer cases and 600,920 cancer deaths are
projected to occur in the United States [94]. Allergy is another common disease a large
population suffers from and is caused by hypersensitivity of immune systems to genetic
and various environmental factors. Roughly 7.8% of people 18 and over in the U.S. have
hay fever, a common allergic condition also known as allergic rhinitis.2 Prior studies have
shown that allergy symptoms are highly associated with lost work productivity [64]. Early
detection and treatment support can help reduce lost work productivity and potentially
reduce the health care costs. Influenza is one of the most common viral infection which
affects lungs, nose and throat. It is a contagious disease that has similar symptoms as
cold but usually more severe, lasts longer, and can cause various complications leading
to deaths. In recent years, influenza activity tracking using social media has been a very
active area of research followed by Google Flu Trend that estimates prevalence of influenza
activity using aggregated google search query log data. Early detection of influenza levels
can help reduce the impact caused by pandemic and provide more time to prepare an
emergency response. Centers for Disease Control and Prevention (CDC) collects and
reports prevalence of influenza-like illness (ILI) based on physicians visits data across
the country with a two weeks time lag. We explored using Twitter posts mentioning
1https://www.cancer.gov/about-cancer/understanding/statistics
2
http://www.aaaai.org/about-aaaai/newsroom/allergy-statistics
21
symptoms of influenza as a real-time resource to track influenza levels and built neural-
network based real-time and 1-week ahead flu forecast models using both Twitter and
CDC data as features.
Users describe their health conditions, ask questions related to a certain disease or
treatment on social media. However, the colloquial nature of the languages used in social
media makes it difficult to automatically map the medical concepts present in the text to
standard medical terminologies. In addition, various ways of describing the same medical
condition poses an additional challenge for an automated system to understand the con-
texts. By mapping medical concepts in online user-generated texts to standard medical
ontology terms, automatic systems would be able to search relevant clinical resources such
as biomedical literatures for clinical question answering, extract treatment information,
and use the aggregated large-scale clinical data to track and detect disease spread for
population health.
This work demonstrates that social media is a useful resource to obtain health-related
information and the aggregated personal health information can be used for population
health management. Our main contributions are building automatic systems that 1)
classify trending topics and posts into general categories to help information search in
a specific domain such as health [1], 2) mine Twitter data as a real-time resource to
monitor disease (allergy, cancer, influenza) activities [2, 3, 4], 3) predict current and
future influenza levels by combining social media data with observed data from CDC
for features [5], and 4) normalize medical concepts described in user-generated texts to
standard medical ontology terms [6].

22
CHAPTER 2
Twitter Trending Topic Classification
2.1. Introduction
Twitter1 is an extremely popular microblogging site, where users search for timely
and social information such as breaking news, posts about celebrities, and trending topics.
Users post short text messages called tweets, which are limited by 140 characters in length
and can be viewed by user’s followers. Anyone who chooses to have other’s tweets posted
on one’s timeline is called a follower. Twitter has been used as a medium for real-time
information dissemination and it has been used in various brand campaigns, elections, and
as a news media. Since its launch in 2006, the popularity of its use has been dramatically
increasing. As of June 2011, about 200 million tweets are being generated every day.
When a new topic becomes popular on Twitter, it is listed as a trending topic, which
may take the form of short phrases (e.g., Michael Jackson) or hashtags (e.g., #election).
2
What the Trend provides a regularly updated list of trending topics from Twitter. It is
very interesting to know what topics are trending and what people in other parts of the
world are interested in. However, a very high percentage of trending topics are hashtags,
a name of an individual, or words in other languages and it is often difficult to understand
what the trending topics are about. It is therefore important to classify these topics into
general categories for easier understanding of topics and better information retrieval.
1
http://www.twitter.com
2http://www.whatthetrend.com
23
Figure 2.1. Tweets related to Trending Topic Boone Logan.
The trending topic names may or may not be indicative of the kind of information
people are tweeting about unless one reads the trend text associated with it. For example,
#happyvalentinesday indicates that people are tweeting about Valentines Day. A trend
named Boone Logan is indicative that tweets are about person named Boone Logan.
Anyone who does not follow American Major League Baseball (MLB), however, will not
know that the information is regarding Boone Logan, who is a pitcher for the New York
Yankees unless a few tweets are read from this trending topic as shown in Figure 2.1.
We found that trend names were not indicative of the information being transmitted
or discussed either due to obfuscated names or due to regional or domain contexts. To
address this problem, we defined 18 general classes: arts & design, books, business, charity
& deals, fashion, food & drink, health, holidays & dates, humor, music, politics, religion,
science, sports, technology, tv & movies, other news, and other. Our goal was to aid users
searching for information on Twitter to look at only smaller subset of trending topics by
classifying topics into general classes (e.g., sports, politics, books) for easier retrieval of
information.
To classify trending topics into these predefined classes, we proposed two approaches:
the well-known Bag-of-Words text classification and using social network information. In
24
this paper, we used supervised learning techniques to classify Twitter trending topics.
First, we employed a well-known text classification technique called Naive Bayes (NB)
[73]. A document in NB would model as the presence and absence of particular words.
A variation of NB is Naive Bayes Multinomial (NBM), which considers the frequency of
words and can be denoted as:
Y
(2.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd
where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-
ability of a document occurring in class c, and P (tk |c) is the conditional probability of
term tk occurring in a document of class c. A document d in our case is trend definition
or tweets related to each trending topic.
Apart from text-based classification, we also incorporated Twitter social network in-
formation for topic classification. For the latter we made use of topic-specific influential
users [78], which were identified using Twitter friend-follower network. The influence
rank was calculated per topic using a variant of the Weighted Page Rank algorithm [102].
In general, a tweeter is said to have high influence if the sum of the influence of those
following him/her is high. The key idea of the proposed network-based approach was to
predict the category of a topic knowing the categories of its similar topics. Similar topics
were identified using user-similarity metric, which was the cardinality of the intersection
of influential users between two topics ti and tj divided by the cardinality of top s influ-
encers of topic ti [78]. We experimented using different classifiers, for example, C5.0 (an
improved version of C4.5) [87], k-Nearest Neighbor (kNN) [23], Support Vector Machine
(SVM) [44], Logistic Regression [66], and ZeroR (the baseline classifier), and found that
25
C5.0 classifier resulted in the best accuracy on our data set. Experimental results showed
that both our approaches effectively classified trending topics with high accuracy, given
that it was a 18-class classification problem. This work was published in [1].
2.2. Related Works
A number of recent papers have addressed the classification of tweets. Sriram et al. [97]
classified tweets to a predefined set of generic classes such as news, events, opinions, deals,
and private messages based on author information and domain-specific features extracted
from tweets such as presence of shortening of words and slangs, time-event phrases, opin-
ionated words, emphasis on words, currency and percentage signs, “@username” at the
beginning of the tweet, and “@username” within the tweet. Genc et al. [49] introduced
a wikipedia-based classification technique. The authors classified tweets by mapping
messages into their most similar Wikipedia pages and calculating semantic distances be-
tween messages based on the distances between their closest wikipedia pages. Kinsella
et al. [60] included metadata from external hyperlinks for topic classification on a social
media dataset. Whereas all these previous works used the characteristics of tweet texts or
meta-information from other information sources, our network-based classifier used topic-
specific social network information to find similar topics, and used categories of similar
topics to categorize the target topic.
Sankaranarayanan et al. [91] built a news processing system that identified tweets
corresponding to late breaking news. Issues addressed in their work included removing
the noise, determining tweet cluster of interest using online methods, and identifying
relevant locations associated with the tweets. Yerva et al. [103] classified tweet messages
26
to identify whether they were related to a company or not using company profiles that
were generated semi-automatically from external web sources. Whereas all these previous
works classified tweets or short text messages into two classes, our work classified tweets
into 18 general classes such as sports, technology, politics, health, etc.
Becker et al. [27] explored approaches for distinguishing tweet messages between real-
world events and non-event messages. The authors used an online clustering technique
to group topically similar tweets together, and computed features that could be used to
train a classifier to distinguish between event and non-event clusters.
There had been a lot of research in sentiment classification of short text messages. Go
et al. [51] introduced an approach for automatically classifying sentiment of tweets with
emoticons using distant supervised learning. Pang et al. [80] classified movie reviews
determining whether a review was positive or negative. But none of these classified
Twitter trending topics.
2.3. Data and Methods
!
234+'(!
253'/&'(! 253'/&'(! !237$89#43/! 4*1&$!
=#$3(.5@! 25#&'&'(!
2.*&>!!! 2.*&>! ! %./31&'(!! 6*1&$! !
G! !!!!?#/@!(#(#! ,A4&>! ! ! !
"3H'&+.'! !! 25#&'&'(!
!
6*1&$! ! 25#&'&'(!
6*1&$!
!9A59355@! C#4D&.'!
! ! ! !
25#&'&'(!
&*#/! !!$3>D'.1.(@! !
6*1&$!
!
$.@!4$.5@!B! $E!F!,.E&34! !
! "#$#!%&'&'(!
?#/@! 4A*359.;1! 4*.5$4!
! !:3$;.5<89#43/!! )*+,&-#+.'!!
! (#(#! $.5'#/.! .$D35!'3;4! %./31&'(! #'/!0#1&/#+.'!
2;33$4! !
! !
"#$#!=.113>+.'! ?#931&'(! "#$#!%./31&'(! %#>D&'3!?3#5'&'(!
Figure 2.2. System Architecture.

27
As shown in Figure 2.2, the proposed classification system consisted of four stages:
Data Collection, Labeling, Data Modeling, and Machine Learning. In our experiments, we
used two data modeling methods: (1) Text-based data modeling, and (2) Network-based
data modeling.
2.3.1. Data Collection
The website What the Trend provides a regularly updated list of ten most popular topics
called “trending topics” from Twitter. A trending topic may be a breaking news story
or it may be about a recently aired TV show. The website also allows thousands of
users across the world to define, in a few short sentences, why the term is interesting or
important to people, which we refer to as “trend definition”. The Twitter API3 allows
high-throughput near real-time access to various subsets of public Twitter data. We
downloaded trending topics and definitions every 30 minutes from What the Trend and
all tweets that contained trending topics from Twitter while the topic was trending.
All the tweets containing a trending topic constituted a document. For example, while
the topic “superbowl” was trending, we kept downloading all tweets that contained the
word “superbowl” from Twitter, and saved the tweets in a document called “superbowl”.
In case a tweet contained more than two trending topics, the tweet was saved in all
relevant documents. For example, if a tweet contained two trending topics “superbowl”
and “NFL”, the same tweet was saved into two documents called “superbowl” and “NFL”.
From 23000+ trending topics that we had downloaded since February 2010, we randomly
selected 768 topics as our dataset.

3https://dev.twitter.com/
28
Figure 2.3. Web interface deployed for manual labeling. Annotators read
the trend definition and tweets before labeling trending topics as one of the
18 classes.
2.3.2. Labeling
We identified 18 classes for topic classification. The classes were art & design, books,
charity & deals, fashion, food & drink, health, humor, music, politics, religion, holidays
& dates, science, sports, technology, business, tv & movies, other news, and other. Since
Twitter is a primary source of news or information, the news related to political events
29
%+*"
%'!"
%'*"
%#*"
%**"
,-./01"23"425678"
&#"
!)"
!*" ($" ()"
+*"
$#"
'&"
'*"
#$" #("
##" #'"
%!" %&" %("
#*" %$" %)"
!" !"
*"
Figure 2.4. Distribution of 768 topics across 18 classes.
were classified as politics. If the topic was about news that was not in any of the categories,
it was classified as other news. If the trend definition or tweet text was gibberish or if it
was in a language other than English, then we classified the topic as other category. The
data was labeled by reading topic’s trend definition and few tweets.
We used two annotators to label all topics. In case of disagreement, a third annotator
intervened. For the labeling task, a random sample of 1,000 topics was selected. From
the 1,000, we narrowed the data set down to 768 topics for mainly two reasons. First, the
topic had no trend definition. Second, the third annotator could not finalize the label.
For each of the 768 topics in our dataset, its five most similar topics were also labeled,
30
which were required for the network-based modeling as described in Section 2.3.3.2. We
ended up manually labeling 3,005 topics because some of the similar topics were common
to more than one topic. Figure 2.3 shows the web interface we deployed for the labeling
task.
Figure 2.5. Word cloud of trending topics in technology class
The distribution of data over the 18 classes is provided in Figure 2.4. The sports
category had the highest number of topics (19.3%), followed by other category (12%).
Except for categories other news, tv & movies, and music, all other categories contained
less than 6.8% of the topics. Figure 2.5 shows examples of trending topics that were
classified as technology.
2.3.3. Data Modeling
2.3.3.1. Text-based Data Modeling. In order to use text-based document models,
the data which comprised of topic’s trend definition, tweets and label was processed in
31
two stages. In the first stage, for each topic, a document was created from trend defini-
tion and varying numbers of tweets (30, 100, 300, and 500). From the document text,
all tokens with hyperlinks were removed. This document was then assigned a label corre-
sponding to the topic. In the next stage, the document was run through a string-to-word
vector kernel, which consisted of two components. The first component was the tokenizer
that removed delimited characters and stop words. We used a customized stop words list
catered to Twitter lingo4. The second component transformed the tokens into tf-idf (term
frequency–inverse document frequency) weights [73]. Here, we experimented with up to
top 500 and 1,000 frequent terms per category. For each of the 18 labels, top most fre-
quent words with their tf-idf weights were used to build the dataset for machine learning
in the next step.
2.3.3.2. Network-based Data Modeling. As an alternate to text-based data model-
ing, in network-based data modeling we used Twitter specific social network information.
An interesting aspect of Twitter network structure is that a linkage indicates common
interest between two users and is directed and asymmetric. User A can freely choose to
follow user B without B’s consent and B does not necessarily have to follow A. We used
the algorithm from User Similarity Model [78] to find five most similar topics for trend-
ing topic X. The algorithm used the class of similar topics that were manually labeled
in section 2.3.2 to predict the class of topic X. In user similarity model, topic-specific
influential users were computed using Twitter social network information such as tweet
time, number of tweets made on a topic, and friend-follower relationship. Then, using
4http://www.twithawk.com
32
/010#$2!'+30"!
'(")*+#+,-!
!!" /010#$2!'+30"!
/010#$2!'+30"!
'(")*+#+,-! !#" 4$2,('!'+30"! !!"

'(")*+#+,-!
!!!!!!!!"#$%%&!
!!" !!"
/010#$2!'+30"! /010#$2!'+30"!
'(")*+#+,-! $%&'()*"+".($#%!
Figure 2.6. Trending topic “macbook” and its 5 similar topics
the number of common influential users between two topics, most similar topics were cal-
culated. Although the user similarity model captured different dimensions of similarity
such as temporal and geographical, our assumption was that a majority of the similar
topics would fall into the same category as the target topic and hence we could predict
the category of target topic using the categories of its similar topics.
Table 2.1 and Figure 2.6 show an example of the topic “macbook”, its five most similar
topics, and number of common influential users between topic “macbook” and its similar
topics. Trending topic “macbook” was classified as technology by manual labeling, and
its five most similar topics (“iwork”, “magic trackpad”, “#landsend”, “apple ipad” and
33
“mobileme”) were manually labeled as technology, technology, charity & deals, technology,
technology. The numbers in Fig. 2.6 indicate the number of common influential users who
tweeted about both “macbook” and its similar topic. The resulting data for machine
learning in this case consists of 768 rows and 19 columns. Each row represents a trending
topic. 18 columns represent 18 classes and the last column represents the class label. Since
topic “macbook” has four similar topics in technology, sum of four values of common
influential users corresponding to its similar topics in technology (11+11+11+10=43)
becomes the value for row “macbook” and column technology in the table. And the value
corresponding to its similar topic “#landsend” becomes the value for row “macbook” and
column charity & deals.
Table 2.1. Five most similar topics of topic “macbook” in class technology.
Similar Topic Y Class of Topic Y Number of Common Influential Users

between Topics X and Y
iwork technology 11
magic trackpad technology 11
#landsend charity & deals 11
apple ipad technology 11
MobileMe technology 10
2.3.4. Machine Learning
The two datasets constructed as a result of the two approaches in the Data Modeling
stage were used as inputs to machine learning stage. We built predictive models using
various classification techniques and selected the ones that resulted in the best classifica-
tion accuracy. The experimental results are discussed in next section.

34
2.4. Experiments and Results
For our experiments, we used popular tools such as WEKA [100] and SPSS mod-
eler [56]. WEKA is a widely used machine learning tool that supports various modeling
algorithms for data preprocessing, clustering, classification, regression and feature selec-
tion. SPSS modeler is another popular data mining software with unique graphical user
interface and high prediction accuracy. It is widely used in business marketing, resource
planning, medical research, law enforcement and national security. In all experiments, 10-
fold cross-validation was used to evaluate the classification accuracy. The ZeroR classifier
which simply predicts the majority class was used to get a baseline accuracy.
2.4.1. Text-based classification
),# $!%"$#
$"%&"#
$(%)$#
!&%*(#
$,#
!"# !'#
!,# '!%"(#
''%!#
'+%*"#
!""#$%"&'()*'
',#
",#
(&%+)#
+,#
(,#
,#
Figure 2.7. Text-based accuracy comparison over different classification

techniques.
35
Using Naive Bayes Multinomial (NBM), Naive Bayes (NB), and Support Vector Ma-
chines with linear kernels classifiers (SVM-L), we found that the accuracy of classification
is a function of number of tweets and the frequent terms. Fig. 2.7 presents the comparison
of classification accuracy using different classifiers for text-based classification. TD repre-
sents the trend definition. Model(x,y) represents classifier model used to classify topics,
with x number of tweets per topic and y top frequent terms. For example, NB(100,1000)
represents the accuracy using NB classifier with 100 tweets per topic and 1,000 most
frequent terms (from text-based modeling result).
In our experiments, NB model always provided a lower accuracy over NBM model
because it models the word counts and adjusts the underlying calculations. SVM-L per-
formed better than NB but had slightly lower accuracy compared to NBM. If only trend
definition was used, irrespective of the most frequent word terms, the accuracy was much
lower for all three classifiers compared to using trend definition plus tweets. The experi-
mental results suggested that NBM classifier using text from trend definition, 100 tweets,
and a maximum of 1,000 word tokens per category gave the best accuracy of 65.36%.
2.4.2. Network-based classification
Fig. 2.8 compares classification accuracy of different algorithms for network-based classifi-
cation. Clearly, C5.0 decision tree classifier gave the best classification accuracy (70.96%)
followed by k-Nearest Neighbor (63.28%), Support Vector Machine (54.349%), and Logis-
tic Regression (53.457%). C5.0 decision tree classifier achieved 3.68 times higher accuracy
compared to the ZeroR baseline classifier. The 70.96% accuracy was very good consider-
ing that we categorized topics into 18 classes. To the best of our knowledge, the number
36
*"'
!"#$%&'
!"'
%(#)*&'
%"'
+,#(,&' +(#,+&'
+"'
!""#$%"&'()*'
,"'
("'
)"' ,$#",&'
-"'
"'
.'+#"' /01234256' <=>>;46'?2@6;4' C;875D@' F24;E'
12789:;4' A3@97B2' E2842557;B'
Figure 2.8. Network-based accuracy comparison over different classification

techniques.
of classes used in our experiment was much larger than the number of classes used in any
earlier research works (binary classification is the most common).
2.5. Summary
In this paper, we explored two different classification approaches for Twitter trending
topic classification. Apart from using text-based classification, our key contribution is
the use of social network structure rather than using just textual information, which
can be often noisy given the context of social media such as Twitter due the heavy use
of Twitter lingo and the limit on the number of characters that users are allowed to
generate for their messages. Our results show that network-based classifier performed
significantly better than text-based classifier on our dataset. Considering tweets are not
as grammatically structured as regular document texts, text-based classification using

37
Naive Bayes Multinomial provides fair results and can be leveraged in cases where we
may not be able to perform network-based analysis.

38
CHAPTER 3
Mining Social Media Streams to Improve Public Health
Allergy Surveillance
3.1. Introduction
Allergy is the fifth most common chronic diseases in the United States1. The complex-
ity and severity of allergic diseases are increasing worldwide [82]. One in five Americans
have either allergy or asthma symptoms. In 2012, 7.5% of adults (17.6 million adults) and
9% of children (6.6 million children) were diagnosed with hay fever [30, 29]. Continuous
use of allergy medication can worsen patients’ health conditions and lead to side effects
and other serious medical complications. Furthermore, an increasing number of allergy
patients gives rise to allergy-related health care cost and leads to reduced work produc-
tivity. $7.9 million is annually spent on allergy-related health care systems and business.
4 million workdays are lost due to hay fever each year. Therefore, an accurate allergy
surveillance and forecast is important to minimize the healthcare cost and maximize work
productivity lost due to allergy symptoms.
Twitter, one of the largest social networking website, allows users to post short text
messages called tweets that can be up to 140 characters in length. Twitter has over
328 million monthly active registered users. Twitter has been used as a valuable real-
time information resource for various applications. For instance, Twitter data have been
1http://www.webmd.com/allergies/allergy-statistics
39
used to detect earthquakes in Japan [89], predict the stock market [33] and for an in-
depth study of 2011 Egyptian Revolution [10]. On Twitter, people not only make general
chatters but also share photos, news, opinions, emotions, and even health conditions
including symptoms and medications they are taking for their diseases. In recent years,
many researchers have investigated using Twitter for disease surveillance, especially for
influenza epidemic detection and prediction [81, 39, 22, 96, 26, 38, 65, 93, 69].
In this paper, we mined a large scale Twitter data collected over 28 months to monitor
allergy levels. More specifically, 1) a bag-of-words supervised learning approach was
employed to distinguish tweets that mentioned actual incidents of allergy from those that
talked about news or general awareness about allergy, 2) text-mining techniques such as
n-gram extraction and part-of speech tagging were applied to extract predominant allergy
types, and 3) a spatiotemporal mining was applied to track allergy levels over time and
space.
We believe that our work is the first framework towards real-time allergy surveillance
using a fine-grained spatiotemporal analysis on a large-scale social media data. The data
analysis results reveal that Twitter is an excellent resource for detecting allergy prevalence.
Our proposed system can help see the past and current trend of allergy levels detected
in social media stream. The real-time analysis results are updated on our allergy project
website [14]. This work was published in [4].

40
3.2. Our Approach
3.2.1. Datasets
3.2.1.1. Twitter dataset. We collected allergy-related tweets from public tweet stream
using Twitter’s streaming API2. We collected over 6.3 million tweets that mentioned
‘allergy’ or ‘allergies’ created by over 3.1 million unique users over 28 months from January
2013 to April 2015. Some talked about their allergy symptoms (e.g., Walked out of my
house confused as to why my eyes felt like they were on fire and then I realized it’s allergy
season.) while others talked about allergy types (e.g., I sneezed like eight times in a row.
This pollen allergy is killing me.) or allergy treatments/medication they took (e.g., sitting
in doctor’s office just to get an allergy shot.).
3.2.1.2. Ground Truth Data.
Pollen dataset. We collected monthly average pollen levels and 90 day historic pollen
levels for U.S. major cities from pollen.com3. The pollen level is a number between 0
and 12 and divided into five categories: 0.0-2.4 (low), 2.5-4.8 (low-med), 4.9-7.2
(medium), 7.3-9.6 (med-high), 9.7-12.0 (high).
Climate dataset. Climate Data Online (CDO)4 provides free access to National
Climatic Data Center (NCDC)’s archive of global historical weather and climate data.
We collected daily and monthly temperature and precipitation data generated since
January 2013 (because the earliest allergy-related Twitter data we had was generated in
January 2013) for major U.S. cities and states. More than a half of the climate data
2
https://dev.twitter.com/docs/streaming-apis
3
http://www.pollen.com/
4
http://www.ncdc.noaa.gov/cdo-web/
41
collecting stations did not report daily temperatures at all, and many, among those that
did report temperature, had missing values.
Allergy patients’ dataset. We used data from the first Quest Diagnostics Health
Trends allergy report, Allergies Across America5. This report is the largest analysis of
allergy testing of patients in the United States under the evaluation for medical
symptoms associated with allergies. We collected a ranked list of most prevalent food
allergies grouped by patients’ ages and a ranked list of the worst U.S. cities for different
allergy types.
3.2.2. Methodology
3.2.2.1. Data Preprocessing. As we were interested in messages that mentioned actual
allergy incidents, we removed all retweets (20.51% of our initial dataset) and tweets that
were not written in English (2.9% of our initial data set). Special HTML characters were
replaced with human-readable characters (e.g., replaced < with < (i.e., less-than sign),
replaced > with > (i.e., greater-than sign)) and all hyperlinks were replaced with string
‘URL’.
3.2.2.2. Data Classification. While some tweets talked about a person having allergy
symptoms, other tweets talked about news, questions, general awareness of allergy sea-
son or information/advertisement regarding allergy medicine/treatments. It is important
to distinguish tweets that mention actual allergy incidents to infer precise allergy levels.
Hence, we classified tweets into two classes. First, we manually labeled 2,000 randomly
selected tweets into positive and negative. A tweet was labeled as positive if it talked
5
https://www.questdiagnostics.com/dms/Documents/Other/2011_QD_AllergyReport.pdf
42
Table 3.1. Tweets with positive and negative labels. A tweet is positive if
it talks about the author or someone around the author having allergy. A
tweet is negative if it is a question or talks about news, general awareness
or information about allergies.
Positive(+1) Tweet
Negative(-1)
+1 My allergies are going insane today.
(Author has allergy)
+1 Stupid allergies not letting me sleep.
+1 Recently my lovely allergy to cats has led to my throat clos-
ing up n barely being able to breathe.
+1 I never been able to enjoy spring cause my allergies. I hate
having itchy eyes and running nose.
+1 @user1 @user2 and @user3 are all dying because of their
allergies.. and Im just sitting here.. #popapill
(People around author have allergies)
-1 In the United States, around 15 million people have food al-
lergies, according to Food Allergy Research and Education.
(News)
-1 Does anyone know good food near Happy Hollow that has
vegetarian options and is easy for seafood allergies?
(General question)
-1 Notice the increase in allergy ads on TV? Yep, spring is
around the corner.
(Awareness about spring season)
-1 RT @CureAllergies: What You Should Do To Manage Your
Allergies - URL.
(Information for allergy management)
Table 3.2. Classification performance of various classifiers using 10-fold

cross validation. The best classification performance (F-measure of 0.811
and ROC area of 0.905) was obtained using NaiveBayesMultinomial (NBM).
Classifier Precision Recall F-measure ROC Area
NBM 0.811 0.811 0.811 0.905
NB 0.799 0.793 0.793 0.864
Random Forest 0.812 0.800 0.799 0.888
SVM 0.818 0.810 0.809 0.814
43
about the author or someone around the author having allergy symptoms. A tweet was
labeled as negative if it talked about news, advertisement, or general awareness of al-
lergies. Table 3.1 shows example tweets with positive and negative labels. The text
in parenthesis indicates the reason for the positive or negative annotation. We used a
bag-of-words text classification where n-grams in documents were used as features. We
removed common stop words except the pronouns I, me, my, you, and your because we
found that these pronouns were important features in classifying tweets into positive and
negative examples of actual incident of allergy. To create features, we applied Weka [53]’s
StringToWordVector filter. All unigrams, bigrams, and trigrams were used to construct
the feature vector if they appeared at least twice in the training data. Then the filter
converted words into their stems, applied TF-IDF weighting scheme, and kept 500 most
frequently used n-grams in the final feature vector. We then explored four different ma-
chine learning algorithms (NaiveBayes (NB), NaiveBayes Multinomial (NBM), Random
Forest (RF), Support Vector Machine (SVM)) that are commonly used for text classifica-
tion. In our classification task, both precision and recall were equally important. Thus,
F-measure and ROC area were used to compare performance of classification algorithms.
As shown in Table 3.2, the best classification performance (F-measure of 0.811 and
ROC area of 0.905) was obtained using NBM and 10-fold cross validation on labeled data.
We built a model using NBM on our training set, and classified all remaining tweets (after
removing retweets and tweets in non-english) into positive or negative. We used NBM
because it had the best performance on our training data, and several prior works had
shown that NBM outperformed other classification algorithms. For example, McCallum
and Nigamcite [75] found NBM to outperform simple NB, especially at larger vocabulary
44
sizes, and Lee et. al. [1] showed that the performance of NBM was better than that of NB
or SVM in 18-class tweet text classification. In our entire allergy corpus, 63% of tweets
were classified as positive and 37% were classified as negative. Only tweets in positive class
(i.e., tweets classified as mentions of actual allergy incidents) were used for our analysis.
TF-IDF (term frequency–inverse document frequency) [73]. The tf-idf measure allows
us to evaluate the importance of a word to a document. The importance is proportional
to the number of times a word appears in the document but is offset by the frequency of
the word in the document. Thus tf-idf is used to filter out common words.
NaiveBayes Multinomial (NBM) [75]. A document in NB would model as the
presence and absence of particular words. A variation of NB is Naive Bayes Multinomial
(NBM), which considers the frequency of words and can be denoted as:
Y
(3.1) P (c|d) ∝ P (c) P (tk |c),
1≤k≤nd
where P (c|d) is the probability of a document d being in class c, P (c) is the prior prob-
ability of a document occurring in class c, and P (tk |c) is the conditional probability of
term tk occurring in a document of class c.
3.2.2.3. Text Mining. We wanted to investigate whether we could automatically dis-
cover the most predominant allergy types that people suffer from or talk about on social
media by examining the texts in Twitter posts. From our allergy-related tweet corpus, we
extracted most frequently occurring bigrams where the second word was ‘allergy’. N-gram
is a contiguous sequence of n words in a sequence of text. N-gram models are widely used
in statistical natural language processing.

45
Table 3.3. A list of most frequently used bigrams where the second word is
allergy, ranked by frequency of use in the entire allergy corpus. It includes
many actual allergy types that are in ‘noun noun’ POS tag.
Rank Most Frequently Used 2-grams POS-tag
1. food allergy noun noun
2. peanut allergy noun noun
3. gluten allergy noun noun
4. nut allergy noun noun
5. natural allergy adjective noun
6. hate allergy verb noun
7. skin allergy noun noun
8. lower allergy comparative-adjective noun
9. cat allergy noun noun
10. milk allergy noun noun
11. issues allergy verb noun
12. worst allergy superlative-adjective noun
13. dog allergy noun noun
14. severe allergy adjective noun
15. pollen allergy noun noun
Part-Of-Speech (POS) tagging is a process of tagging a word with a part-of-speech
(lexical category) such as noun, pronoun, verb, adjective, etc. We applied POS tagging
to each bigram. For example, the POS tag for string ‘natural allergy’ is ‘adjective noun’
and the POS tag for string ‘peanut allergy’ is ‘noun noun’. Table 3.3 shows the list of
15 most frequently used bigrams and corresponding POS tags in the descending order of
frequency of use.
Our assumption was that the POS tag of all allergy types (e.g., food allergy, nut
allergy, pollen allergy, dust allergy, egg allergy) should be in the form of ‘noun noun’ and,
therefore, we could obtain a list of allergy types by removing all bigrams that were not in
‘noun noun’ form. In other words, we needed to remove all bigrams that contained non-
nouns (e.g., natural allergy (adjective noun), worst allergy (superlative-adjective noun))
46
to get the final list of allergy types. All bigrams that contained Twitter screen name (e.g.,
@username), stop words or non-english words were also removed.
3.2.2.4. Spatio-temporal Mining. Every tweet comes tagged with a timestamp that
indicates the time when the tweet is posted. For example, the timestamp ‘Sun Mar 02
05:55:02 +0000 2014’ indicates that the tweet is created on Sunday, March 2, 2014 at
5:55am GMT (Greenwich Mean Time). Since we were interested in tracking allergy levels
over time, we used the timestamps to count the volume of tweets posted each day that
mentioned allergy or a specific allergy type, symptom, or treatment.
There are two types of tweet location, a sensor-based geolocation and a text-based user
profile location. A geolocation provides the exact location where the tweet was posted
with latitude and longitude values. This data is available to others only if the Twitter
user selects it to be publicly available. Twitter users can identify home location in his/her
Twitter user profile. We examined user profile locations and extracted state information.
Examples of users’ home locations that had state information were ‘Riverside, CA’, ‘some-
where in NY’ and ‘Gainesville, Florida’. Examples of home locations that lacked state
information were ‘Home Sweet Home’, ‘Somewhere over the rainbow’ and ‘Traveling’.
We tagged each tweet with a 2-character state code (e.g., CA for California) if we were
successful extracting the state information from Twitter user profile.
Some tweets had both geolocation and user profile location, some had one or the other,
and the rest did not have any location information. Geolocations were first translated
into human-readable addresses using reverse geocoding API6 and then the state name was
extracted from the address. For tweets that did not have geolocation, we obtained state
6https://developers.google.com/maps/documentation/geocoding/
47
name from the user profile. Those that did not have any of the two locations were not
used in the spatial analysis.
Table 3.4. 30 most frequently mentioned allergy types automatically ex-

tracted by our algorithm. Numbers indicate the rank of frequency the 2-
gram appears in the allergy corpus and +/- signs indicates whether it is an
actual allergy type(+) or not(-). 26 out of 30 were true positives achieving
a precision of 86.7%.
Rank Allergy Types Rank Allergy Types
1. food allergy (+) 16. shellfish allergy(+)
2. peanut allergy (+) 17. claritin allergy(-)
3. gluten allergy (+) 18. drug allergy(+)
4. nut allergy (+) 19. eye allergy (+)
5. skin allergy (+) 20. asthma allergy (-)
6. cat allergy (+) 21. sun allergy (+)
7. milk allergy (+) 22. mucinex allergy (-)
8. dog allergy (+) 23. prescription allergy (-)
9. pollen allergy(+) 24. nickel allergy (+)
10. spring allergy(+) 25. meat allergy (+)
11. latex allergy (+) 26. bee allergy (+)
12. dairy allergy (+) 27. alcohol allergy (+)
13. dust allergy (+) 28. seafood allergy (+)
14. egg allergy (+) 29. mite allergy (+)
15. wheat allergy(+) 30. penicillin allergy (+)
Table 3.5. Most prevalent food allergies. The rank of the most prevalent
food allergies extracted from Twitter data is very similar to that obtained
from actual allergy patients’ data.
Ground Truth Twiiter Data
Rank Most prevalent food allergies (Age>10) Rank Most mentioned food allergies
1. peanut allergy 1. food allergy
2. wheat allergy (gluten allergy) 2. peanut allergy
3. soybean allergy 3. gluten allergy
4. milk allergy 4. nut allergy
5. egg allergy 5. milk allergy
6. dairy allergy
7. egg allergy
8. wheat allergy
48
3.3. Experimental Results
3.3.1. Text Analysis
Allergy Types. Instead of using a pre-defined keyword list, we automatically identified
allergy types mentioned in our dataset by using natural language processing methods.
For the ground truth data, we created a list of allergy types by combining data from
multiple online resources7. Table 3.4 lists the top 30 most frequently mentioned allergy
types extracted from our allergy corpus by applying methods described in section 3.2.2.3.
The numbers indicate the rank of frequency (1 means the highest frequency, 30 means the
lowest frequency). The signs in the parenthesis indicate whether the extracted allergy type
is positive (an actual allergy type) or negative (not an actual allergy type). Out of top 30
allergy types, 26 were true positives and only 4 were false positives, leading to precision
of 86.7%. Two of the four false positive cases (claritin, mucinex) were allergy medicines,
and the other two cases were allergy-related disease (asthma) and term (prescription).
The traditional method that uses a pre-defined keyword list often fails to identify new
types of diseases, and new keywords (i.e., new disease types) have to be manually added.
However, with our proposed method that automatically identifies disease types, we would
not need the step where new disease types are manually added.
Most Prevalent Food Allergies. We further evaluated our Twitter data analysis
results by comparing it to the real-world allergy patients’ data. Table 3.5 shows the
ground truth value of the most prevalent food allergies in allergy patients in the first
column and the list of most mentioned food-related allergy types from table 3.4. We
7http://www.foodallergy.org/allergens, http://www.webmd.com/allergies/guide/
allergy-symptoms-types, http://acaai.org/allergies/types, http://www.healthline.com/
health/allergies/alcohol
49
used the data for patients older than age ten because most Twitter users fell into this
age group. The allergy types in two columns show that they are in a very similar order
of ranking. Note that gluten and wheat allergy can be considered to be the same and
milk and diary allergy can also considered to be the same. This proves not only how the
extracted allergy types are precise in identifying actual allergy types, but also the rank of
prevalent allergy types have a very strong relationship to the real-world allergy patients’
data.
Figure 3.1. Time-series graph of daily allergy levels detected in tweets (Feb-
ruary 2013 - April 2015). Only those allergy-related tweets labeled as posi-
tive are used to create the graph. The graph illustrates the general allergy
level trend over time. The allergy level is the highest in mid–May, goes
down in June and July, starts rising again in August, and reaches its local
maximum point in mid–September. Similar seasonal patterns are observed
in both 2013 and 2014.
50
Figure 3.2. Monthly average data for allergy tweet count (blue), daily high-
est temperature (green), and pollen level (red) for Washington state (March
2013 – April 2015). Pollen level is highly correlated with ∆temperature
(correlation of 0.776) and ∆tweet count (correlation of 0.706). Tweet count
is very strongly correlated with temperature (correlation of 0.668).
3.3.2. Spatio-Temporal Analysis
In temporal model, we tracked activities of allergy, various allergy types, symptoms and
medications over time using tweet timestamps. Figure 3.1 shows the allergy-related tweet
volume changes over two-years period from February 2013 through April 2015. The
allergy level reaches its annual global maximum in mid-May and a local maximum in mid-
September and this seasonal pattern is observed in both 2013 and 2014. The increased
number of people chatting about their allergies in May and in September indicates that a
51
Figure 3.3. Monthly distribution of mentions of peanut and pollen allergies

(March 2013–April 2015). A huge seasonal variation is observed in monthly
pollen allergy (a seasonal allergy) level compared to that of peanut allergy
(a food allergy).
very large population suffers from spring allergies such as tree pollen allergies and there
is also a quite large population that has allergy symptoms in the fall.
To validate our experimental results, we compared our Twitter data against the actual
pollen levels and the weather data. Because pollen levels and temperatures vary depending
on location, we partitioned allergy-related Twitter data into a finer space granularity (U.S.
state level). Figure 3.2 compares three trend-lines: allergy tweet timeline (blue), monthly
average pollen level (red), and monthly mean max temperature (green) for Washington
state. We show the data for Washington state, not just because a large volume of allergy-
related tweets were generated in WA but also because the ground truth temperature
52
data for WA was available for all dates from March 2013 through April 2015. It is clear
from the graph that all three trend lines illustrate seasonality. An interesting pattern is
that there is an order in time of three trend lines reaching their maximum and minimum
points. The pollen level starts rising first and reaches its peak, followed by tweet counts
and temperature. The trend lines also decrease in the same order.
Our analysis shows that the pollen level is highly correlated with the rate of tempera-
ture change (correlation of 0.776) as well as the rate of tweet count change (correlation of
0.706). In other words, pollen level reaches its peak point when the temperature sharply
increases in spring and, at the same time, allergy-related tweet volume also sharply in-
creases. Also, tweet count has a strong correlation with daily temperature (correlation
of 0.688), meaning allergy tweet count increases as the temperature increases. The high
correlation values show how well the social media data reflects the real-world allergy
activities and hence can be a good source of health data information.
In Figure 3.3, we show how the trend of mentions of two different allergy types differ
over time. The tweet volume mentioning ‘pollen allergy’ (a seasonal allergy) rises very high
during the spring and the fall and remains very low in the summer. However, unlike pollen
allergy, the tweet volume mentioning ‘peanut allergy’ (a food allergy) stays relatively
constant throughout the year. Note that we also carried out the same experiment in U.S.
state level and observed similar patterns in each state. This observation implies that
the seasonality observed in overall allergy dataset in figure 3.2 and figure 3.3 comes from
tweets mentioning various seasonal-allergy-related terms such as spring, tree pollen, or
hay fever, rather than terms related to non-seasonal allergies such as dog, cat, milk or
egg.
53
Figure 3.4. Time-series graph of tweet count for various allergy symptoms
(Feb 2013–Sep 2014). The most common allergy symptom is sneezing (blue
line) throughout the year, followed by cough (green) and runny nose (sky
blue).
Figure 3.4 is a time-series graph showing tweet volume changes for different allergy
symptoms. Sneezing (blue) is the most common allergy symptom throughout the year,
followed by cough (green), runny nose (sky blue), watery eyes (red), and itchy throat
(turquoise). It is very interesting that the rank for different allergy symptoms on each day
is consistent throughout the year. Note that the percentage of Twitter users who enable
their location publicly available has been steadily increasing since we started collecting
our data.
54
(a) Feb 2013 (b) May 2013 (c) Aug 2013 (a) Nov 2013
Figure 3.5. Distribution of allergy tweets with geolocations. The seasonal

pattern of allergy levels across U.S. is clearly visible. Allergy level is the
highest in spring and the lowest in winter.
For 20% of the tweets in our allergy data set, we were able to identify U.S. state
names. 11.4% of those had actual geolocation (longitude and latitude) values. For the
remaining 88.6%, state names were extracted from the user profile locations.
Figure 3.5 shows monthly snapshots of tweets with geolocations that helps us visualize
allergy levels across the U.S. We show quarterly seasonal maps for 2013. Each red dot
on the map represents a tweet that was posted from the location. This map shows a
general spatiotemporal trend of allergy activities. The allergy level starts increasing in
early spring and gets extremely severe in May. It remains high throughout the summer,
and goes down in the fall. Interestingly, most allergy-related tweets come from the eastern
part of the country although there are some from the west coast.
Next, using the U.S. state information we obtained from geolocations and user profile
locations, we visualized the distribution of tweets that mentioned different allergy types.
Figure 3.6 compares levels of peanut allergy (blue bar) and gluten allergy (red bar) de-
tected by social media sensors for each U.S. state. Because a greater number of tweets
were generated from states that had larger population, tweet counts were normalized by
state population and scaled to range between 0 and 100. Kansas had the highest level of
peanut allergy (94.51). South Dakota had the lowest level of both allergy types (3.85 for
55
Figure 3.6. Bar chart comparing monthly social-media-sensed peanut and

gluten allergy levels for each U.S. state. The tweet count is normalized by
state census population and scaled to range between 0 and 100. In most
US states, peanut allergy level is higher than gluten allergy level.
peanut allergy and 0 for gluten allergy). Most states had higher levels of peanut allergy
than gluten with a few exceptions. For example, unlike most other states, Oregon (OR),
Delaware (DE), and Montana (MT) had higher gluten allergy levels.
3.4. Related Works
Before the Internet was widely used, over-the-counter pharmaceutical sales data [72]
and telephone triage data [47] were among the methods that were used for surveillance
of diseases.
Disease Surveillance using online data. In the past decade, with the dra-
matic increase of internet use, online data had been extensively used to retrieve health
information and to detect disease activities. Web search queries data had been studied
56
to track influenza activities. Ginsberg et al. [50] used flu-related google search queries
data to estimate current flu activity near real time, 1-2 weeks in advance of the records
by the traditional flu surveillance system8. Recent research on public health and dis-
ease surveillance using online data have mostly focused on monitoring and predicting
influenza levels. Researchers had used Twitter data to monitor influenza outbreak and to
predict flu activities. Signorini et al. [95] attempted estimating current influenza activity
by tracking public sentiment and applying support vector machine algorithm on Twitter
data generated during the Influenza A H1N1 pandemic. Chew et al. [41] analyzed the
contents and sentiment of tweets generated during the 2009 H1N1 outbreak and showed
the potential and feasibility of using social media to conduct infodemiology studies for
public health. There are many others who have used Twitter data for flu outbreak detec-
tion [81, 39, 22, 96, 26, 38, 65, 93, 69]. Unlike earlier researchers who used Twitter
for flu activity detection and prediction, to the best of our knowledge, our work was the
first attempt examining allergy activities using a large scale Twitter stream.
Tweet Classification. Aramaki et al. [24] proposed a Twitter-based influenza epi-
demics detection method that used Natural Language Processing (NLP) to filter out
negative influenza tweets. Tuarob et al. [99] used ensemble machine learning techniques
to identify health-related messages in a heterogenous pool of social media data. In our
work, we used bag-of-words model and explored using four different machine learning
algorithms to find the best model to classify tweets into those that mention actual allergy
incidents and those that mention general awareness or information about allergy season.
8http://www.cdc.gov/flu/
57
Study of relationship between weather, pollen, and allergy. Many researches
have studied the relationship between weather and pollen levels and how it affects severity
of allergy symptoms in patients [45, 101, 46]. In our work, the allergy levels were
extracted from social media data instead of allergy patients, and we studied relationship
between the trend of allergy-related tweets with the actual pollen levels and temperatures
at U.S. state level.
In this paper, we focused on examining only allergy activity using a large Twitter
stream collected over two years and showed in-depth spatiotemporal analysis results. We
also applied natural language processing techniques to automatically identify prevalent
allergy types from Twitter contents.
3.5. Summary
In this work, we proposed a system that monitored allergy levels near real-time by an-
alyzing streaming Twitter data. We first classified tweets to identify those that mentioned
actual allergy incidents using bag-of-words model and NaiveBayesMultinomial classifier
and then used those tweets with positive labels for text and spatiotemporal analysis.
We used text-mining techniques to automatically detect predominant allergy types.
The top thirty allergy types extracted by our algorithm had precision of 86.7%. The
experimental results further showed that the rank of the most prevalent food allergy
types detected from tweet stream was highly correlated to the ground truth value, the
ranked list of prevalent allergies, obtained from real-world allergy patients’ data.
We demonstrated that tweet time-series graph mentioning seasonal allergy related
terms (e.g., pollen) showed clear seasonal patterns (a large volume of tweets in the spring
58
and a low volume of tweets in the winter) whereas those mentioning non-seasonal allergy
related terms (e.g., peanut) remained relatively constant throughout the year. By study-
ing relationships between allergy tweets with the pollen and the weather data, we showed
that all three data had similar seasonal patterns and allergy tweet data had a very strong
relationship with the daily maximum temperature (correlation of 0.688).
We believe that our work was the first study that examined a large-scale social media
data for in-depth analysis of allergy activities. Although our work had specifically focused
on studying allergy activities, the model could be generalized to track activities of other
diseases.
59
CHAPTER 4
Real-Time Digital Diseases Surveillance using Twitter Data
Demonstration on Flu and Cancer
4.1. Introduction
The Internet is usually the first place people turn for health information. People
search for a specific disease, symptoms, and appropriate medical treatments, and often
make decisions whether they should go see a doctor based on the search results. Healthcare
portal sites and the social media are popular online health information resources among
U.S. Internet users [?]. Disease surveillance is the monitoring of clinical syndromes such
as flu and cancer that have a significant impact on medical resource allocation and health
policy. Disease surveillance plays an important role in minimizing the harm caused by the
outbreaks by constantly observing the disease spread. The traditional approach employed
by the Centers for Disease Control and Prevention (CDC) [18] for flu surveillance includes
the collection of Influenza-like Illness (ILI) patients’ data from sentinel medical practices.
The main drawback of this method is the 1-2 weeks time lag between the time of medical
diagnosis and the time when the data becomes available. Early detection of a disease
outbreak is critical because it would allow faster communication between health agencies
and the public, and provide more time to prepare a response.

60
We built a novel real-time disease surveillance system that used Twitter data to track
U.S. influenza and cancer activities. Twitter1 is a popular micro-blogging service where
users can post short messages. Twitter’s popularity as a medium for real-time information
dissemination has been constantly increasing since its launch in 2006. The proposed sys-
tem continuously downloads flu and cancer related Twitter data using Twitter streaming
API [17] and applies spatial, temporal, and text models on this data to discover national
flu and cancer activities and popularity of disease-related terms. The outputs of the three
models are summarized as pie charts, time-series graphs, and U.S. disease activity maps
on our project websites [15][16] in real time. This demonstration was built upon and ex-
tended our previous work [2]. In this work, the text analysis on most frequently occurring
terms was added. We further extended our real-time disease surveillance system to track
cancer activities in addition to flu. This work was published in [3].
4.2. System Description
Figure 4.1 shows the architecture of our real-time flu and cancer surveillance system.
Our dataset consisted of all recent tweets that mentioned the keywords ‘flu’ or ‘cancer’.
We collected over 6 million flu-related tweets generated by more than 3.3 million unique
users for 5.5 months since October 16, 2012, and over 3.7 million cancer-related tweets
generated by more than 1.3 million unique users for 3 months since January 7, 2013.
Such big data presents a number of challenges due to its size and complexity, relating
to its storage, retrieval, analysis, and visualization, especially when the whole process is
required to be done in real-time as in this work. Our system was designed to be a disease
surveillance system that is (almost) always available, robust, and easily scalable for big
1
https://twitter.com/
61
:)./*&;<4(&"$=4'4'/$
!"#$
3)5;.*&"$=4'4'/$
%&'()*$
+&,&$-,.*&/)$$
&'0$
1)&"2345)$6'&"78(9$
3)>,$=4'4'/$
Figure 4.1. Real-Time Disease Surveillance System continuously down-

loaded flu and cancer related tweets and applied geographical, temporal,
and text mining. The real-time analysis data was visually reported as
U.S. disease activity maps, timelines, and pie charts on our project web-
sites [15][16].
data. Different from many other related big data projects, which performed analytics on a
massive, static dataset, our system consisted of a cluster of several transactional databases
and high-dimensional data warehouses which were updated in real time. In our proposed
system, three types of analytics were considered - geographical/spatial, temporal, and
textual, the results of which were suitably presented pictorially, as described next.
62
Figure 4.2. Our Real-Time Digital Flu Surveillance Website [16]. The
‘Daily Flu Activity’ chart was an output of the temporal analysis and
showed the volume changes of tweets mentioning the word ‘flu’ over time.
The dramatic increase of flu tweet volume from Jan. 6 to Jan. 12 coin-
cided with the dates when the major U.S. newspapers reported Boston Flu
Emergency [21] and deaths of four children from the AH3N2 influenza out-
break [20]. The‘U.S. Flu Activity Map’ was an output of the geographical
analysis and showed the weighted percentage of tweet volumes mentioning
‘flu’ by states. The level of flu activity was differentiated by different colors
for an easy comparison of U.S. regional flu epidemic.
4.2.1. Geographical Analysis
The goal of geographical analysis was to track disease spread in U.S. states by measuring
the volume of flu/cancer tweets generated in the region. For our experiments, we used
users’ home locations in their Twitter profiles. The dataset for geographic analysis was
all users who mentioned ‘flu’ or ‘cancer’ and had a valid U.S. state information (e.g.,
63
‘Evanston, IL’, ‘somewhere in NY’) in their home location fields. We excluded tweets
generated from outside the U.S. (i.e., tweets from foreign countries) and those with invalid
location information (e.g., ‘travelling’, ‘Wherever the wind blows me’). In our flu dataset,
there were 458,828 users with valid U.S. state information, and in our cancer dataset,
there were 193,797 users with valid U.S. state information. The U.S. Flu Activity Map
is shown in Figure 4.2. The tweet volume mentioning ‘flu’ generated in each state was
normalized by the population of the state.
4.2.2. Temporal Analysis
The goal of temporal analysis was to track the volume changes of tweets mentioning the
disease and related terms over time.
Disease Daily Activity Timeline. As shown in Figure 4.2, Daily Flu Activity chart
shows the tweet volume changes of flu-related tweets over three months period from
January through March 2013. The data for flu/cancer timeline is created by counting the
number of tweets mentioning ‘flu’ or ‘cancer’ generated daily. Our assumption was that
people would talk more about ‘flu’ when they themselves or people around them (e.g.,
family or friends) had flu symptoms and there would be more frequent news feeds when
the epidemic was wide spread. Achrekar et al. [22] reported that the volume of flu-related
tweets was highly correlated with the number of reported ILI cases by the CDC. In the flu
timeline, the number of flu related tweets started increasing on January 6 and reached its
peak on January 12, which coincides with the date when The Huffington Post reported
the death of four children from the outbreak of AH3N2 influenza [20]. This showed how
our temporal analysis effectively reflected the wide spread of the epidemics.
64
Figure 4.3. Flu Symptoms Timeline. The timeline displays tweet volume
changes mentioning different flu symptoms from January through March
2013. ‘Cough’ (green line) and ‘fever’ (dark orange line) reach their highest
level in mid January and decrease as the actual national ILI level by CDC
decreases.
Types, Symptoms, Treatments Timelines. We not only tracked the overall flu and
cancer activities, but also monitored disease types, symptoms, and treatments over time.
Figure 4.3 shows the daily tweet volume changes for various flu symptoms. From the
timeline chart, we could easily tell the types and levels of flu symptoms in the general
population at a specific point in time. Cough and fever were the two most dominant
symptoms throughout all flu season, and headache and sore throat were the next two
most common flu symptoms. The actual U.S. national influenza activity level (percentage
weighted Influenza-like Illness by the CDC) was plotted as red squares for reference. Tweet
volumes mentioning flu symptoms reached their highest point around mid January and
decreased as the actual flu activity level from the CDC decreased.
65
Figure 4.4. Distribution of Cancer Types in Tweets.
Figure 4.5. Distribution of Cancer Symptoms in Tweets .
Figure 4.6. Distribution of Cancer Treatments in Tweets .

66
4.2.3. Text Analysis
In text analysis, we revealed deep health insights by examining the content of the tweets.
We were interested in investigating the popularity of terms used in three categories: (1)
disease types (2) symptoms (3) treatments, and created a keyword list for each category.
For example, the keyword list for cancer types was a list of breast cancer, lung cancer,
skin cancer, brain cancer, etc., the keyword list for cancer symptoms was a list of lump,
cough, fatigue, weight loss, etc, and the keyword list for cancer treatments was a list of
surgery, radiation, chemotherapy, Emend, Xeloda, etc. We also had similar keyword lists
for ‘flu’. For ‘flu’, we had 9 flu types, 15 symptoms, and 31 treatments. For ‘cancer’, we
had 58 cancer types, 21 symptoms, and 63 treatments. Figure 4.4, 4.5, and 4.6 show the
distribution of tweets mentioning a keyword in cancer types, symptoms, and treatments
keyword lists.
Figure 4.7. Most Frequent Words in Flu Tweets.
We were interested in investigating which words frequently co-occured with a disease
name. After tokenizing tweet texts and removing all stop words, we counted the number of
occurrence of each unique word. Our flu dataset (6,097,406 tweets) consisted of 83,896,915
67
words and 4,001,445 unique words. Figure 4.7 shows the top 20 most frequent words in
our entire flu dataset.
4.3. Summary
We built a real-time disease surveillance system that used Twitter data to automati-
cally track flu and cancer activities. The experiments showed that our disease detection
system could map U.S. regional influenza and cancer activity levels near real-time, discover
and compare popularity of terms related to flu/cancer types, symptoms, and treatments.
The system could also effectively track daily flu/cancer activities and the volume changes
of tweets mentioning disease related terms over time. All of the output data was visualized
as interactive maps, pie charts, and time series graphs on our project websites [15][16].
Our system is highly scalable and can be easily extended to track other diseases. Because
the system is completely automated, it would be a very low-cost alternative to replace
the traditional high-cost disease surveillance system that collects public health data from
sentinel medical practices.

68
CHAPTER 5
Forecasting Influenza Levels using Real-Time Social Media
Streams
5.1. Introduction
Seasonal influenza is an acute viral infection that can cause severe illnesses and com-
plications. For instance, the annual epidemics cause about 250,000 to 500,000 deaths
worldwide. Centers for Disease Control and Prevention (CDC) reported 105 pediatric
deaths due to influenza during 2012-2013 flu season1. Monitoring of disease activity en-
ables an early detection of disease outbreaks, which will facilitate faster communication
between health agencies and the public, thereby providing more time to prepare a re-
sponse. Disease surveillance helps minimize an impact from a pandemic and make better
resource allocation. The traditional influenza surveillance system by CDC reports weekly
national and regional Influenza-Like Illness (ILI) physicians visit data collected from sen-
tinel medical practices2. This data is updated once a week and there is typically a two
weeks time lag before the data is published. Furthermore, the published data is updated
for several more weeks as more clinical data is gathered.
For an early detection of influenza activity, Ginsberg et al.[50] proposed a method
that used flu-related online search engine query data to estimate the current flu activity
with one day reporting lag, 1-2 weeks ahead of CDC, and its estimation had been known
1
http://www.cdc.gov/flu/spotlights/children-flu-deaths.htm
2
http://www.cdc.gov/flu
69
to be reasonably accurate for most parts. However, in February 2013, an article titled
“When Google got flu wrong” [35] reported Google Flu Trends’s over-estimation of peak
of U.S. flu activity, which was almost double that of CDC’s observations.
During the last decade, the number of internet and social networking site users have
dramatically increased. People share ideas, events, interests and their life stories over the
internet. As of January 2017, Twitter has 100 million daily active users and 5 million
tweets are generated per day3. Experiences and opinions on various topics including
personal health concerns, symptoms and treatments are shared on Twitter. Mining such
publicly available health related data potentially provides valuable healthcare insights.
Furthermore, the increasing number of users that access social media platforms on their
mobile devices makes social media data an invaluable source of real-time information.
In this paper, we proposed a model that (1) predicted future influenza activities,
(2) provided more accurate real-time assessment than before, and (3) combined real-
time social media data streams and CDC historical datasets for predictive models to
accomplish accurate predictions. The results showed that our model using multilayer
perceptron with back propagation on a large-scale Twitter data could forecast current
and future flu activities with high accuracy. The goal of our work was to predict expected
influenza activity for the future, a week or more ahead of time so that it could be used
for planning, intervention, resource allocation and prevention. Furthermore, we aimed to
exploit social media communication for the prediction. This work was published in [5].
3
https://www.omnicoreagency.com/twitter-statistics/
70
5.2. Related Work
For an early detection of disease outbreaks, researchers had used different statistical
and machine learning algorithms on difference sources of data. Over-the-counter phar-
maceutical sales data [72] and telephone triage [47] had been used for surveillance of
ILI. Christakis et al. [43] studied whether monitoring of social friends could provide early
detection of flu outbreaks. Web search queries data had been used for influenza surveil-
lance [48, 55, 84, 104, 50, 93, 83]. Ginsberg et al. [50] used flu-related google search
queries data to estimate current flu activity and the near real-time estimation was reported
on Google Flu Trends (GFT) website4. Researchers had used GFT data to build an early
detection system for flu epidemics [83, 93]. Shaman et al. [93] used GFT data and
WHO/NERVSS collaborating laboratories data to estimate flu activity. The estimated
data was then recursively used to optimize a population-based mathematical model that
predicted flu activity. Pervaiz et al. [83] developed FluBreaks5, an early warning system
for flu epidemics using Google Flu Trends.
The use of social networking sites for public health surveillance had been steadily
increasing in the past few years [37]. Most diseases surveillance works using social media
data were focused on Twitter. A very unique feature of Twitter is that messages propagate
in real time. Many had used Twitter data to predict various real world outcomes [89,
26, 32].
For current estimation of influenza activity, Signorini et al. [95] applied support vector
regression algorithm to Twitter stream generated during the influenza A H1N1 pandemic
to public sentiment, and Achrekar et al. [22] used auto-regression with exogenous inputs
4
http://www.google.org/flutrends
5
http://www.newt.itu.edu.pk/flubreaks
71
(ARX) model on Twitter data. In our previous work, we built a real-time disease surveil-
lance website that tracked U.S. regional and temporal flu activities including popularity
of terms related to flu types, symptoms, and treatments [2, 3]. Aramaki et al. [24] pro-
posed a Twitter-based influenza epidemics detection method that used natural language
processing (NLP) to filter out negative influenza tweets. Chew et al. [41] analyzed con-
tent and sentiment of tweets generated during the 2009 H1N1 outbreak and showed the
potential and feasibility of using social media to conduct infodemiology studies for public
health.
Paul and Dredze [81] applied Ailment Topic Aspect Model to track illnesses over times
(syndromic surveillance), measure behavioral risk factors, localize illnesses by geographic
region, analyze symptoms and medication usage, and showed the broad applicability of
Twitter data for public health research. Li [69] proposed Flu Markov Network (Flu-MN),
a spatio-temporal unsupervised Bayesian algorithm based on a 4 phase Markov Network
for flu activity prediction. Lampos et al. [65] proposed an automated tool that tracked
ILI in the United Kingdom using a regression model and Bolasso, the bootstrapped ver-
sion of LASSO, for features extraction of Twitter data. Lamb et al. [63] classified tweets
into different categories to distinguish those that reported infections versus those that ex-
pressed concerns about flu, tweets about authors versus tweets about others in an attempt
to improve performance of influenza surveillance. Researchers had studied the diversity
of tweets [57], ran real-time spatio-temporal analysis of West Nile virus using Twitter
data [61]. Sugumaran and Voss advised to integrate existing epidemic systems, those
that used crowd-sourcing, news media (e.g., GPHIN, MedISys), mobile/sensor network,
and real-time social media intelligence, for an improved early disease outbreak system [98].
72
Chakraborty et al. [38] combined social indicators and physical indicators and used a ma-
trix factorization-based regression approach using neighborhood embedding to predict ILI
incidences in 15 Latin American countries.
Retrospective analysis and current estimates are important as they can describe the
observed trends. However, further prediction of future flu levels can represent a big leap
because such predictions provide actionable insights for public health that can be used for
planning, resource allocation, treatments and prevention. In contrast to other approaches,
we proposed a system that not only estimated current flu activity more accurately, but
also forecasted future influenza activities a week in advance beyond the current week
using aggregated ILI data by CDC and real-time Twitter data. The results showed that
our proposed model using multilayer perceptron with back-propagation algorithm could
forecast both current and future influenza activities with high accuracy.
5.3. Method
The data collection and modeling process is illustrated in Figure 5.1.
5.3.1. Dataset
We continuously downloaded publicly available tweets that mentioned ‘flu’ using Twitter
Streaming API6. The dataset used in this paper consisted of 20 million tweets generated
between December 2012 and May 2014. 71 weeks’ data (from week 1, 2013 until week 19,
2014) were used to build the model. Disambiguation of tweets was performed using text
analysis techniques to understand if a tweet was about a person talking about his/her own
flu or about someone else’s or if there were any mentions of common symptoms. Table 5.1
6
https://dev.twitter.com/docs/streaming-apis
73
!
"#$%&'!()**+! >1>!
,+-*./! !
! CDC!1.+.!'5%%*'456!
0!1&2./$&3#.456! =-5/!2*646*%!
0!7&%+*-&63! /*?&'.%!E-.'4'*2!
0!8*+)5-9!:6.%;2&2!
!
1.+.!"-5'*22&63! :6.%;2&2! C6F#*6G.!75-*'.2+!

! !
!
0!,/55+<&63! 0!"-*?&'4@*!A5?*%&63!
0!B.%&?.456!
!
0!(&/*!,*-&*2!!!!
!!"!#$%&'(
!
(-.62=5-/.456! !"!#$%&)(
!"!#$%&*(
!
0!:%&36/*6+!)&+<!>1>! +,--%.#$%&'(
! 0$1234+1"#252#$%(
!
+,--%.#$%&)(
+,--%.#$%&*(
?.+.! +,--%.#$%&/(
+,--%.#$%(
!
Figure 5.1. Data collection and modeling process. Disambiguation, filter-

ing and network analysis were performed on continuously downloaded flu-
related tweets. Weekly time-series flu-related tweet counts were computed
after data was smoothed out to align with CDC data. Current and 1-week
ahead flu prediction models were built.
lists examples of flu-related tweets. In the category column, user indicates that the tweet
is about the Twitter user being sick with flu, someone else indicates that the tweet is
about someone else (friends, family, etc.) being sick with flu, and symptom indicates
that the tweet describes one’s flu symptoms. Data was filtered to remove tweets that may
contain product advertisements (or links to websites) and using network analysis repeated
tweets by the same persons were filtered.
5.3.2. Data Preprocessing
The following data preprocessing steps were taken on Twitter data.

74
Table 5.1. Examples of flu-related tweets.

Tweet Category
I’ve got the worst flu ever... already D: user
After a week sick in bed with the flu, look what I just woke up to! user
trying to get over this flu... I had completely forgot how much harder user
it is to deal with it during pregnancy.. feeling like death :”c
This flu and cough is killing me T.T user, symptom
Coding OAuth2 filters with a flu and fever... I look better with a user, symptom
mask on!
@friend feel better! The flu is nooo fun! Huggs!! someone else
My roommate has the flu and I get sick really fast I am packing my someone else
stuff and won’t be returning
please pray for my mom she’s caught the flu and is extremely ill at someone else
this moment
Sore throat, fever, flu, headache, cough. Uhuk uhuk symptom
sick with flu, sore throat, and slight fever. symptom
• Smoothing: We took 7-day moving average of daily tweet volume to identify the
long-term flu activity trend by smoothing out the fluctuations and noise in the
short-term data. Moving average is a popular technique for analyzing time-series
data that is often used in financial data analysis such as stock prices.
• Weekly counts and alignment: Weekly Twitter data was then computed
by summing smoothed daily tweet volumes from Sunday through Saturday. The
dates for weekly Twitter data were aligned with dates in CDC weekly surveillance
reports so that analysis and predictions could be validated with CDC reports.
• Normalization: Weekly data was normalized by dividing each weekly data by
the maximum of 72 weekly data points.
5.3.3. Feature Selection
In order to perform predictive modeling, features from the data were defined and extracted
as described below. Figure 5.2 depicts the data available at the end of week t. Wt denotes
75
!"#$% % %!"#& %%%%%%%%%!"#' %%%%%%%%%!"#( %%%%%%%!" %%%%%!")( %%%%%!")' %%%!")&%%%%
*+*%+,",%,-,./,0/1%234/%!"#'%
56.718%+,",%,-,./,0/1%234/%!"%
Figure 5.2. Data available at current week t. At the end of week t, all flu-
related Twitter data collected during current week t and prior are available.
At time t, past two weeks (Wt−1 and Wt )’ CDC data is not available as
CDC’s collection, retrospective analysis and reports take two weeks.
Table 5.2. CDC and Twitter features used in flu prediction model.
Notation Description
CDC-4-3-2 CDC ILI Data for Wt−4 , Wt−3 , Wt−2
CDC-3-2 CDC ILI Data for Wt−3 , Wt−2
CDC-2 CDC ILI Data for Wt−2
Twitter-4-3-2-1-0 Twitter Data for Wt−4 , Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-3-2-1-0 Twitter Data for Wt−3 , Wt−2 , Wt−1 , Wt
Twitter-2-1-0 Twitter Data for Wt−2 , Wt−1 , Wt
Twitter-1-0 Twitter Data for Wt−1 , Wt
Twitter-0 Twitter Data for Wt
the current week and any time window beyond this represents the future. Wt−n denotes n
week(s) prior to current week, and Wt+n denotes n week(s) after current week. Each week
starts on Sunday and ends on Saturday to align with CDC weekly data. CDC data for
current week, Wt , and the week before, Wt−1 , is not available due to the time it takes to
collect patients data from the sentinel practices. The latest available CDC data is weekly
data for Wt−2 .
Since we were able to download publicly available tweets in real time, we had all
Twitter data generated during Wt . We used the most recent 5 weeks’ data for both CDC
and Twitter in our experiments. We experimented with different combinations of CDC
and Twitter data shown in table 5.2 as features of our predictive model to find the best
76
Table 5.3. Twitter data improves prediction performance.

Current Forecast
Feature Correlation Coefficient Improvement
CDC-4-3-2 Twitter-4-3-2-1-0 0.9525 +2.93%
CDC-4-3-2 0.9232
1-Week Ahead Forecast
Feature Correlation Coefficient Improvement
CDC-3-2 Twitter-4-3-2-1-0 0.9268 +6.37%
CDC-3-2 0.8631
Table 5.4. Comparison of current flu forecast model’s performance when

different learning rates and a varying number of hidden layers and hidden
units are used. The highest correlation of 0.9559 was obtained using learning
rate λ = 0.2 and one hidden layer with 4 activation units.
Number of activation units in first and second hidden layers
Learning Rate 2-0 3-0 4-0 5-0 2-2 3-2 4-2 5-2 2-3 3-3
λ = 0.1 0.9517 0.9496 0.9501 0.946 0.7359 0.8843 0.8976 0.9008 0.8973 0.9143
λ = 0.2 0.9548 0.954 0.9559 0.9527 0.9482 0.9481 0.9469 0.946 0.9498 0.9485
λ = 0.3 0.953 0.9548 0.9532 0.9499 0.9509 0.9511 0.95 0.9495 0.9518 0.9512
Learning Rate 4-3 5-3 2-4 3-4 4-4 5-4 2-5 3-5 4-5 5-5
λ = 0.1 0.9038 0.9115 0.915 0.9117 0.9182 0.9134 0.9168 0.9176 0.9256 0.9224
λ = 0.2 0.9465 0.9457 0.9501 0.948 0.9472 0.9455 0.9502 0.9483 0.9472 0.9466
λ = 0.3 0.9495 0.9492 0.9521 0.9506 0.9504 0.9491 0.9523 0.951 0.9504 0.9496
Table 5.5. Comparison of 1-week ahead flu forecast model’s performance

when different learning rates and a varying number of hidden layers and
hidden units are used. The highest correlation of 0.929 was obtained using
learning rate λ = 0.2 and one hidden layer with 4 activation units.
Learning Rate 2-0 3-0 4-0 5-0 2-2 3-2 4-2 5-2 2-3 3-3
λ = 0.1 0.9115 0.9176 0.9064 0.9018 0.8919 0.894 0.8907 0.8908 0.8984 0.8947
λ = 0.2 0.8996 0.904 0.929 0.9268 0.88 0.8843 0.8792 0.8768 0.8917 0.883
λ = 0.3 0.8491 0.8845 0.9268 0.8944 0.8831 0.878 0.8788 0.8775 0.887 0.8799
Learning Rate 4-3 5-3 2-4 3-4 4-4 5-4 2-5 3-5 4-5 5-5
λ = 0.1 0.8937 0.8931 0.8958 0.8981 0.8961 0.895 0.8957 0.8979 0.8981 0.8969
λ = 0.2 0.8806 0.8804 0.8948 0.8957 0.8877 0.8833 0.8965 0.8939 0.8916 0.8869
λ = 0.3 0.8759 0.8775 0.8893 0.8846 0.9023 0.8767 0.8902 0.9055 0.881 0.8824
features for influenza prediction. The model was trained and validated using 10-fold cross
validation on 71 weeks data. As shown in table 5.3, the best feature for the current
flu level forecast model was feature CDC-4-3-2 Twitter-4-3-2-1-0 (latest 3 weeks’ CDC
77
plus latest 5 week’s Twitter data) with correlation coefficient of 0.9525, with +2.93%
performance improvement over feature CDC-4-3-2 (latest 3 weeks’ CDC data). The best
feature for 1-week ahead prediction model was CDC-3-2 Twitter-4-3-2-1-0, which resulted
in correlation coefficient of 0.9268, with +6.37% improvement over CDC-3-2. This clearly
showed that adding Twitter data significantly improved the performance of both current
and future flu level forecasts compared to that using only past CDC data.
5.3.4. Predictive Modeling
The proposed model had two parts. The first estimated current flu activity in terms of
percentage of ILI-related physicians visit (2 weeks ahead of CDC data). The second part
was forecasting future influenza activity a week into the future (3 weeks ahead of CDC
data). We used multilayer perceptrons (MLP) with back propagation as it had the best
performance among many learning and predictive modeling algorithms we experimented
with in forecasting both current and future influenza activities. In our experiments, we
used 3-layer MLP with 4 activation units in the hidden layer. The network structure for
our current flu activity forecast model is shown in figure 5.3.
5.4. Results
Table 5.4 and 5.5 show how the performance of current and 1-week ahead forecast
model changed with different values of learning rate and a varying number of hidden lay-
ers and units in each hidden layer respectively. In notation ”A-B”, A indicates the number
of activation units in first hidden layer (layer 2) and B indicates the number of activation
units in second hidden layer (layer 3). Both the current and the 1-week ahead forecast
models achieved the best performance using learning rate λ = 0.2 and 3-layer multilayer
78
CDC_Wt-4
CDC_Wt-3
CDC_Wt-2
Tweets_Wt-4
%WEIGHTED_ILI_Wt
Tweets_Wt-3
Tweets_Wt-2
Tweets_Wt-1
Tweets_Wt
Figure 5.3. Structure of multilayer perceptron used in our influenza activity

forecast model.
perceptron structure (input layer, 1 hidden layer, output layer) with 4 activation units in
the hidden layer as shown in Figure 5.3.
Current Influenza Activity Estimation
Our current flu forecast model used CDC-4-3-2-Twitter- 4-3-2-1-0 (i.e., all currently
available CDC and Twitter data generated in recent 5 weeks) as features because it gave
the highest correlation of 0.9525 when the model was trained and validated using 10-fold
cross validation on 71 weeks data. Although our Twitter dataset had been collected for
1.5 years, each weekly data made only one data point for the weekly flu activity forecast
model. To best utilize the number of available data points, we built the initial model
using the first one year data (52 data points for year 2013) with 10-fold cross validation.
Then, each week, we incrementally built a new model with all available data points. For
example, a new model was trained using 52 data points (week 1, 2013 – week 52, 2013) to
% Weighted Influenza-Like Illness % Weighted Influenza-Like Illness
0
2
4
6
8
10
12
0
2
4
6
8
10
12
2012-week1 2012-week1
2012-week3 2012-week3
2012-week5 2012-week5
2012-week7 2012-week7
2012-week9 2012-week9
2012-week11 2012-week11
2012-week13 2012-week13
2012-week15 2012-week15
2012-week17 2012-week17
2012-week19 2012-week19
2012-week21 2012-week21
2012-week23 2012-week23
2012-week25 2012-week25
2012-week27 2012-week27
2012-week29 2012-week29
2012-week31 2012-week31
2012-week33 2012-week33
2012-week35 2012-week35
2012-week37 2012-week37
2012-week39 2012-week39
2012-week41 2012-week41
2012-week43 2012-week43
2012-week45 2012-week45
2012-week47 2012-week47
2012-week49 2012-week49
2012-week51 2012-week51
2013-week1 2013-week1
2013-week3 2013-week3
2013-week5 2013-week5
2013-week7 2013-week7
2013-week9 2013-week9
unseen test data points were obtained.

Time
Time
2013-week11 2013-week11
2013-week13 2013-week13
2013-week15 2013-week15
2013-week17 2013-week17
2013-week19 2013-week19
2013-week21 2013-week21
2013-week23 2013-week23
2013-week25 2013-week25
2013-week27 2013-week27
2013-week29 2013-week29
(a) Current U.S. Influenza activity
2013-week31 2013-week31
(b) 1-week ahead U.S. Influenza activity 2013-week33 2013-week33

2013-week35 2013-week35
2013-week37 2013-week37
2013-week39 2013-week39
2013-week41 2013-week41
2013-week43 2013-week43
2013-week45 2013-week45
2013-week47 2013-week47
2013-week49 2013-week49
2013-week51 2013-week51
2014-week1 2014-week1
2014-week3 2014-week3
2014-week5 2014-week5
2014-week7 2014-week7
over 52 training data and a correlation coefficient of 0.71 over 19 previously 2014-week9 2014-week9
were obtained. For 1-week ahead forecast, a correlation coefficient of 0.895
data and a correlation coefficient of 0.929 over 19 held-out test data points
current week prediction, a correlation coefficient of 0.9522 over 52 training
activity forecast results against CDC and Google Flu Trends data. For
Figure 5.4. Comparison of our current and 1-week ahead U.S. influenza
2014-week11 2014-week11
CDC (% ILI)
2014-week13 2014-week13
2014-week15 2014-week15
Google Flu Trends
CDC (% ILI)
2014-week17 2014-week17
2014-week19 2014-week19
Current Prediction
model was for the first week of 2013 because we started collecting flu-related Twitter data
and Google Flu Trends data (GFT) [50] (green line). The earliest prediction by our
compares our flu activity prediction (red line) against the actual CDC %ILI (blue line)
a larger data set and therefore be more robust. Figure 5.4 is a time-series graph that
2, 2014. As we continued to collect more Twitter data, the model would be trained on
using 53 data points (week 1, 2013 – week1, 2014) to make current prediction for week
make current flu level prediction for week 1, 2014. Then a newer model was built again
79
Google Flu Trends
1-Week Ahead Prediction

80
in late 2012. Both our prediction (Fig. 5.4(a)) and GFT data were available two weeks
earlier than the official CDC ILI report. Our model was fitted on 52 weeks data (week 1,
2013 – week 52, 2013) with a correlation of 0.9522 and a mean absolute error (MAE) of
0.2383, and was further validated on 19 previously unseen weekly data (week 1, 2014 –
week 19, 2014) with a correlation of 0.929 and MAE of 0.493. Our prediction did as well
or better than the GFT data at most data points, and aligned very well with the CDC
ILI data. Furthermore, our prediction performed significantly better than GFT during
January 2013 when GFT’s algorithm significantly overestimated peak flu levels [35].
Future Influenza Activity Forecast
Our 1-week ahead flu forecast model used CDC-3-2-Twitter-4-3-2-1-0 as features. This
feature set provided the highest correlation of 0.9268 on the model trained and validated
using 10-fold cross validation on 71 weeks data, which was higher than the correlation
of 0.8952 obtained by using only CDC-3-2. Here also adding Twitter data improved
the model performance. An initial model was built using the first one-year data and a
newer model was incrementally rebuilt in the following weeks (in a similar manner our
current flu forecast model was built). Our 1-week ahead forecast data (Fig. 5.4(b)) was
available 3 weeks ahead of the official CDC ILI report and 1 week ahead of GFT data. The
model was fitted using 52 data points (week 1, 2013 - week 52, 2013) and incrementally
rebuilt using all available data (including the new weekly data collected during the current
week) thereafter. The final model was validated by measuring a correlation between the
CDC weekly percentage weighted ILI and that predicted by our model on 19 additional
previously unseen weekly data points (week 1, 2014 through week 19, 2014). A correlation
81
of 0.895 and MAE of 0.3846 were obtained on the training data and a correlation of 0.71
and MAE of 0.662 were obtained on the previously unseen test data. These results were
very good considering our forecast data was available 3 weeks faster than the official CDC
data.
5.5. Summary
We presented a model that predicted weekly percentage of U.S. population with
Influenza-Like Illness using multilayer perceptron with back propagation algorithm on
a large-scale social media stream. Adding recent flu-related Twitter data as features
improved the model’s performance for both current and future forecasts. Our proposed
model could predict current and future influenza activities with high accuracy 2-3 weeks
faster than the traditional flu surveillance system could. The performance for the cur-
rent prediction was comparable to or better (in January 2013) than GFT. We expect the
model’s performance to improve as we continuously collect more Twitter data. We believe
these results present a very important step in not only accurately forecasting flu activity
for the future, prevention, resource planning, but also demonstrating a technique that
can combine social media, unstructured communication data, with observational data for
prediction.
82
CHAPTER 6
Medical Concept Normalization
6.1. Introduction
On social media and online health communities, people often share their experiences
and opinions on various health topics including personal health issues and symptoms.
Especially, on medical forums, consumers ask health related questions, write reviews
on medications and describe negative side effects they experience while taking a drug.
Moreover, patients and their families can get emotional support by sharing their stories
of overcoming illnesses.
Medical concept normalization for user-generated texts aims at mapping a health
condition described in colloquial language to a medical concept in standard ontologies such
as Unified Medical Language System (UMLS) [71] via concept unique identifiers (CUIs).
This task has many applications for improving patient care such as: 1) understanding
questions and providing answers to patients/families seeking medical knowledge, 2) early
detection of patients who need immediate attention and medical support (e.g., people with
suicidal ideation), 3) digital disease surveillance (e.g., monitoring of pandemics), and 4)
clinical paraphrasing to improve patient engagement by helping patients understand their
clinical reports.
While consumers describe their health conditions in colloquial language, clinical knowl-
edge sources such as biomedical literature present medical terms in scientific language.
83
Table 6.1. Medical concepts in UMLS and example social media phrases
that describe the medical concept
Medical Concept Social Media Phrases

loss of hair hair falling out, hair loss, hair losss, losing my hair, thinning hair, hair has
started falling out, hair is getting very thin, hair was falling out
memory impairment memory problem, memory failure, memory deficits, poor memory, trouble re-
membering, memory weakened, couldn’t remember, foggy brain
ankle pain ankle hurt, ankles started aching, pain in ankles, ankles seized up, sore ankles,
terrible pain in my ankles, ankles ache so bad
diarrhoea direar, diaharrea, diahhrea, sore and stiff ankles, diahrea, diarrehea, dioreah,
dioreaha, bathroom with the runs
difficulty sleeping can not sleep, difficult to sleep, hard time sleeping, inability to sleep well, lousy
sleeping at night, poor sleep, problems sleeping, trouble sleeping
This gap in the use of languages between patients/consumers and clinicians requires map-
ping of one to the other. In order to generate solutions to a given medical problem (e.g.
to answer questions posted on an online health community), health conditions in user-
generated texts need to be normalized to medical concepts in standard ontologies. Once
the solution is generated, it needs to be translated back to colloquial language for users
to easily understand.
Table 6.1 shows examples of user-generated texts from social media that describe
medical concepts. The labels in the top row are medical concepts from the standard
medical ontologies and the phrases in the same column denote example phrases from social
media that describe the concept. The examples very well illustrate the characteristics of
colloquial language or non-standard terms used to describe medical conditions on social
media. As can be seen in the table, the challenges for medical concept normalization
include: 1) alternative descriptions for health conditions in colloquial language (e.g., ‘sore
and stiff ankles’, ‘terrible pain in my ankles’, ‘ankles ache so bad’ → ankle pain; ‘trouble
sleeping’, ‘cannot sleep’, ‘hard time sleeping’ → difficulty sleeping, and 2) no overlaps
of terms between colloquial language and scientific/medical terms describing the same
84
health condition (e.g., ‘couldn’t remember’ → memory impairment, ‘sight loss’ → visual
impairment, ‘trouble remembering’, ‘foggy brain’ → memory impairment). In the latter
case, basic string matching approaches without understanding semantics of the text will
result in a poor performance in a medical concept normalization task. Other challenges
include misspellings or typos as shown for the concept ‘diarrhoea’.
In this work [6], we aimed to address the aforementioned challenges using deep learning-
based architectures and studied the impact of different types of input data used to build
neural embeddings on the medical concept normalization performance.
Our key contributions are:
• We investigated the use of various domain-specific text data to build neural em-
beddings to learn semantic features of medical concepts for normalization.
• We demonstrated that two deep learning models (CNN and RNN) could better
predict the medical concepts when we used neural embeddings trained on domain-
specific clinical texts compared to those trained on a larger general domain text
corpus.
• Our best results presented the new state-of-the-art for two benchmark datasets,
outperforming the accuracy of a strong normalization model by up to +21.17%
on the Twitter data set and up to +21.28% on the AskAPatient data set.
This chapter is organized as follows. In section 6.2, we present related works on deep
neural network models, social media for healthcare, and medical concept normalization.
In section 6.3, we describe CNN and RNN models we used for concept normalization. In
section 6.4, we describe how we re-created the social media datasets and present the details
85
of text data from various clinical knowledge sources used to build neural embeddings. In
section 6.5, we present our experimental results, followed by conclusion in section 6.6.
6.2. Related Work
6.2.1. Social Media for Healthcare
Social media had been widely used as a new medium for real-time information transmis-
sion in various domains including health to track volume of mentions of disease, drugs,
and symptoms [3, 4], predict influenza activities, and detect adverse drug events (ADE)
earlier than the traditional influenza or ADE surveillance systems that had significant
time delays in data processing [40, 68]. For automatic extraction of medical concepts
from social media, researchers had used machine learning approaches such as CRF (Condi-
tional Random Fields) and HMM (Hidden Markov Model) to extract phrases that describe
medical concepts (e.g., disease, drugs, symptoms) [79, 90], identify relationships between
two medical concepts (e.g., duration, frequency, dosage, route for a drug, indication, side
effects, etc.), and to classify texts into different categories (e.g., health vs. non-health,
ADE vs. non-ADE) [99, 92, 68].
6.2.2. Deep Neural Network Models
Recurrent neural network (RNN) models have shown to be very effective in many natural
language processing (NLP) tasks. Unlike traditional neural network models, RNNs use
sequential information. Hence they are well-suited for tasks such as machine transla-
tion, speech recognition, language modeling and image caption generation. Traditionally,
convolutional neural network (CNN) models have been widely used in image processing
86
tasks (e.g., automatic recognition of hand-written numbers, object detection) because of
their ability to learn task-relevant features. However, with the recently proposed word
embedding models (word2vec) by Mikolov et al. [76, 77], deep neural network models for
NLP tasks have gained popularity. Kim [59] showed that a simple one layer CNN model
trained on top of pre-trained word vectors outperform several state-of-the-art models for
text classification such as sentiment analysis and question classification. Lee et al. [68]
explored semi-supervised CNN models to detect adverse drug events in tweets and demon-
strated that neural word embeddings trained on a smaller domain-specific dataset helped
more than the one trained on a larger random dataset for ADE classification. Deep learn-
ing models have also shown to be highly effective in other healthcare tasks such as clinical
diagnostic inferencing [86] and clinical neural phraphrase generation [54, 85].
6.2.3. Concept Normalization
Traditional approaches used for medical concept normalization include lexicon-based
string matching, heuristic string matching, and rule-based text mapping to a set of pre-
defined variants of terms [88, 25, 74]. DNorm [67] is a state-of-the-art concept (disease
name) normalization system that is based on pairwise learning to rank that learns similari-
ties between mentions and concept names. Limsopatham et al. used a machine translation
approach in which a social media phrase is translated into a formal medical concept. More
recently, Limsopatham et al. [70] showed that simple deep learning models, convolutional
neural network (CNN) and recurrent neural network (RNN), with pre-trained word em-
beddings induced from a large collection of Google News (GNews) and BioMed Central
87
(BMC) articles improved the performance over previous state-of-the-art concept normal-
ization models and reported that GNews was more effective than BMC for both CNN and
RNN across all datasets.
Our work significantly improved on the results from Limsopatham et al. [70] by refining
their original datasets and leveraging neural embeddings of various health-related text to
better learn the semantic characteristics of medical concepts and provided a new state-
of-the-art accuracy for medical concept normalization.
6.3. Model Description
In this section, we describe two deep learning models, convolutional neural network
(CNN) and recurrent neural network (RNN), we use for medical concept normalization.
/"%*"#.)$"% '()$*+)$"%,-.%()$"% !""#$%&
;< 01
8==) 02
8==# 04
03
#$?=
/67,
7 ! !
8"9,8""),:+$%
B+*=
@)"%= 05>1
A9.$@=@ 05
Figure 6.1. Generic convolutional neural network architecture.
6.3.1. Convolutional Neural Network (CNN)
CNN is a feed-forward neural network model that learns task-relevant semantic features
for text classification. Figure 6.1 depicts a simple CNN with an input layer, followed by
88
a convolutional layer with multiple filters, a pooling layer, and a final softmax classifier.
The input layer of CNN are phrases or sentences represented as a matrix. Each row of the
matrix is a low-dimensional vector (word embeddings) representing a token or a word.
Formally, given an input phrase x of length j, where x = xi , xi+1 , . . . , xi+j denotes a
sequence of words, and xi denotes a k-dimensional word vector, a filter w ∈ Rhk is applied
to a window of h words to produce a new feature in a convolution layer. For example, a
feature ci is generated as follows:
(6.1) ci = f (w · xi:i+h−1 + b)
from a window of words xi:i+h−1 where b is a bias and f is a nonlinear activation function.
Each feature is applied to the input matrix to produce a feature map. Then the features
are passed to a fully connected softmax layer to output the most probable label [59].
For example, for the eight word phrase ‘my feet feel like I have stone bruises’ using 300-
dimensional embedding, the input to the CNN would be a 8 x 300 matrix and the output
would be a CUI representing the medical concept ‘foot pain’.
6.3.2. Recurrent Neural Network (RNN)
RNN is a family of artificial neural networks that uses its internal memory to process
variable-length sequential data. Figure 6.2 shows an unrolled RNN architecture, where
xt , yt , ht are the input, output, and hidden states at time step t, and W , U , V are model
parameters corresponding to input, hidden, and output layer weights shared across all
time steps [54].

89
y0 y1 yt
Output Layer
V V V
h0 h1 ht
U U ... U
Hidden Layer
W W W
Input Layer
x0 x1 xt
Figure 6.2. Generic recurrent neural network architecture.
The hidden state ht can be formulated as follows:
(6.2) ht = f (W xt + U ht−1 ),
where the ht−1 is the previous hidden state, xt is the the current input, and f is an
element-wise nonlinear activation function.
Although RNN is a powerful model to encode sequences, it suffers from the vanishing
gradient problem while it tries to efficiently learn long-range dependencies [28]. We used
a gated recurrent unit (GRU) [42], which is known to be a successful remedy to the
vanishing gradient problem. The hidden state of GRU ht can be formulated as follows:
zt = σ(W z xt + U z ht−1 )
rt = σ(W r xt + U r ht−1 )
kt = tanh(W k xt + U k (rt ht−1 ))
(6.3) ht = (1 − zt ) kt + zt ht−1 ,
GRU cell has two gates, an update gate zt , and a reset gate rt . kt is the candidate hidden
state. zt , rt are computed using different weight parameters where zt determines how
90
much of the old memory to keep while rt determines how to combine the new input with
the previous memofy. Finally, kt is computed by exploiting rt , and ht is calculated to
denote the amount of information needed to be transmitted to the following layers.
6.4. Experimental Setup
6.4.1. Data
We used two data sets, TwADR-L (from Twitter) and AskAPatient, used by Limsopatham
et al. [70] for medical concept normalization1. TwADR-L was created by the authors of
[70], and AskAPatient dataset was created by Karimi et al. [58] for ADR (adverse drug
reaction), from which the authors extracted the gold standard mappings of phrases to
medical concepts.
Table 6.2. Data Statistics after removing duplicates from the combined
training, validation, and test data
TwADR-L AskAPatient
# unique phrases 2,944 4,469
# unique labels 2,220 1,036
# unique phrase-label pairs 3,157 4,496
# phrases with multiple labels 173 26
Min # examples per label 1 1
Max # examples per label 36 141
Avg # examples per label 1.42 4.35
In the original dataset, the TwADR-L had 48,057 training, 1,256 validation and 1,427
test examples. The test set (all test samples from 10 folds combined) consisted of 765
unique phrases and 273 unique classes (or medical concepts). The AskAPatient dataset
contained 156,652 training, 7,926 validation, and 8,662 test examples. The entire test
set (all test samples from 10 folds combined) consisted of 3,749 unique phrases and 1,035
1Available at https://zenodo.org/record/55013#.WKXwdxIrLde
91
Table 6.3. Examples of phrases with multiple labels
Social Media Phrase Multi-Labels (Medical Concepts)

shaking shivering, trembling, tremor
mad anger, rage
have no emotion emotional disorder, indifferent mood
mood swings bipolar disorder, disturbance in mood
sore pain, myalgia
high blood pressure increased venous pressure, hypertension,
findings of increased blood pressure
unique classes (medical concepts). The authors randomly split each dataset into ten equal
folds, ran 10-fold cross validation and reported the accuracy averaged across the ten folds.
We found that, in the original data set, many phrase-label pairs appeared multiple
times within the same training data file and also across the training and test data sets in
the same fold. In the AskAPatient data set, on average 35.82% of the test data overlapped
with training data in the same fold. In the Twitter (TwADR-L) dataset, on average
8.62% of the test set had an overlap with the training data in the same fold. Having
a large overlap between the training and the test data could potentially introduce bias
in the model and contribute to high accuracy. Therefore to remove the bias, we further
cleaned and recreated the training, validation, and test sets such that each phrase-label
pair appeared only once in the entire dataset (either in training, validation or test set).
First, we combined all examples in training, validation and test data from the original
data set and then removed all duplicate phrase-label pairs (examples that had the same
phrase and label pair and appeared more than once in training/validation/test datasets).
Table 6.2 shows statistics of the new dataset after removing duplicates. The Twitter data
set had 3,157 unique phrase-label pairs and 2,220 unique labels (medical concepts) while
173 phrases had multiple labels (i.e., they were assigned to more than one label). Many
92
concepts had only one example, and the concept that had the most number of examples
had 36 phrases. On average, each concept had 1.42 examples. The AskAPatient data set
had 4,496 unique phrase-label pairs and 1,036 unique labels while 26 phrases had multiple
labels. Table 6.3 shows examples of phrases that had multiple labels. For example, ‘mad’
could be mapped to ‘anger’ or ‘rage’, and ‘sore’ could be mapped to ‘pain’ or ‘myalgia’.
Second, we removed all concepts that had less than five examples. The statistics of the
final data are shown in Table 6.4. Third, we divided all examples without multiple labels
into random 10 folds such that each unique phrase-label pair appeared once in one of the
10 test sets. We added the pairs with multiple labels into the training data. This final
10-folds dataset was used in all our experiments.
Table 6.4. Data Statistics after removing concepts that had less than five
examples
TwADR-L AskAPatient
# unique phrases 543 2,494
# unique labels 65 228
# unique phrase-label pairs 617 1,427
# phrases with multiple labels 173 26
Min # examples per label 5 5
Max # examples per label 36 78
Avg # examples per label 9.5 11
6.4.2. Data Sources for Word Embedding
In this section, we describe different types of unlabeled text data we used for building
neural embeddings.
93
Figure 6.3. Definition, example sentence, synonyms, related words, near

antonyms and antonyms for the word ‘sore’ obtained from Merriam-Webster
Thesaurus.
6.4.2.1. Thesaurus (TH). For each word in TwADR-L and AskAPatient dataset (both
phrases and labels), we obtained the following six types of information from the Merriam-
Webster thesaurus2: definition, example sentence, synonyms, related words, near antonyms,
and antonyms. Figure 6.3 illustrates the information that was obtained for the word ‘sore’,
the second last example shown in Table 6.3. The definition of ‘sore’ included the label
‘pain’ and the list of synonyms also included ‘painful’ (an adjective form of the label
‘pain’). Therefore, the word embeddings built with the thesaurus would help the model
learn the semantics and predict the label ‘pain’.
Figure 6.4. Medical definition of the term ‘myalgia’ obtained from Merriam-
Webster Medical Dictionary.
2https://www.merriam-webster.com/thesaurus
94
6.4.2.2. Medical Dictionary (MD). We collected definitions from the Merriam-Webster
Medical Dictionary3, which contains 60,000 words and phrases used by healthcare pro-
fessionals. It is also used in the National Library of Medicine’s consumer health website
to help consumers with spelling of medical words and understanding of medical notes
written by physicians4. For each unique word in TwADR-L and AskAPatient dataset, we
obtained a medical definition (if present) using the Merriam-Webster medical dictionary
API5. The dictionary contains clinical terms that may not be found in the thesaurus.
We found that while definitions for some terms were same in both the thesaurus and the
medical dictionary, for other terms, either they used slightly different words/phrases, or
one or both did not have a definition at all. For example, the word ‘myalgia’ was in the
medical dictionary, but not in the thesaurus. As shown in Figure 6.4, we were able to
collect the definition for the word ‘myalgia’, a medical term that was not found in the
thesaurus.
6.4.2.3. Clinical Texts (CT). Clinical Texts is a collection of sentences from the fol-
lowing sources in the medical domain.
Adverse Drug Reaction Classification System (ADReCS)6: is a comprehensive
ADR ontology database that provides both standardization and hierarchical classification
of ADR terms [36]. The database integrates ADR and drug information collected from
3
https://www.merriam-webster.com/medical
4
https://www.nlm.nih.gov/news/mplusdictionary03.html
5
https://www.dictionaryapi.com/products/api-medical-dictionary.htm
6
http://bioinf.xmu.edu.cn/ADReCS/
95
various public medical repositories like DailyMed7, MedDRA [34], SIDER2 [62], Drug-
Bank8, PubChem9, and UMLS. It contains 6.7K unique ADR terms and 1,698 drug names,
and 154K drug-ADR pairs. For each term in the ADReCS database, we collected its defini-
tion and synonyms. For example, the definition of the word ‘myalgia’ is ‘painful sensation
in the muscles’ and its synonyms are ‘myalga’, ‘myaigia’, ‘soreness’, ‘muscle pain’, ‘muscle
ache’, etc.
Biomedical Literature: We collected 301,790 sentences from all wikipedia pages
that were under the category of clinical medicine10. We also collected 4,271 sentences
from PubMed articles from the adverse drug events benchmark corpus [52].
Medical Concept to Lay Term Dictionaries: We used two medical to lay terms
dictionaries to create a collection of sentences11,12. These dictionaries contain professional
medical terms and their definitions described in lay language. For example, the medical
term ‘anesthesia’ is defined in lay language as ‘loss of sensation or feeling’, the term
‘cephalalgia’ as ‘headache’, and the term ‘dyspnea’ as ‘hard to breathe’ or ‘short of breath’.
From these dictionaries, we generated sentences (e.g., ‘Anesthesia refers to loss of sensation
or feeling’, ‘cephalalgia means headache’) by combining a term and its definition with a
connecting phrase randomly chosen from a small preselected set (e.g., stands for, refers
to, indicates, means, etc.). We created a total of 1,556 sentences from these sources.
7
https://dailymed.nlm.nih.gov/dailymed/
8
https://www.drugbank.ca/
9
https://pubchem.ncbi.nlm.nih.gov/
10
https://en.wikipedia.org/wiki/Category:Clinical_medicine
11
http://gsr.lau.edu.lb/irb/forms/medical_lay_terms.pdf
12https://depts.washington.edu/respcare/public/info/Plain_Language_Thesaurus_for_
Health_Communications.pdf
96
UMLS Medical Concept Definitions: We extracted a total of 167,550 sentences
that defined medical terms in the UMLS Metathesaurus [31], a large biomedical thesaurus
consisting of millions of medical concepts and used by professionals for patient care and
public health.
Table 6.5. Medical concepts and similar words based on cosine similarity
obtained from word embeddings built with different health-related text cor-
pora.
Medical Clinical Text Medical Thesaurus Health-related

Concept Dictionary Tweets
(CT) (MD) (TH) (HT)
depression dysthymia arthritic recession boredom
anxiety mood disorder weightgain
schizophrenia-like diminution collapse obesityWHO
benzodiazepine-induced exertion lassitude irritability
hopelessness fatigue lethargy anxiety
insomnia apnea sleeplessness depressionchronic
derealization wakefulness migraines
sleep – restlessness weightgain
dysthymic hyperexcitability
awakening stressrelated
dizzy lightheaded verge woozy lightheaded
faint restless fainting nauseous
nauseated light-headed whirling headache
swaying lamely faint lethargic
shaky paranoia feeble sleepfeeling
myalgia backache arthralgia
arthralgia athralgia
asthenia – – muscleampjoint
aches odynophagia
fatigability bodymuscle
hypertension dyslipidemia arterial diseaseheart
renovascular hypotension diabetes
nephrosclerosis narrowing – dyslipidemia
beta-antagonists weakness pressurehigh
Gestosis diallation arteriosclerosis
97
6.4.2.4. Health-related Tweets (HT). We collected 100 million publicly available
health-related tweets that mentioned 116 common diseases and symptoms (e.g., flu, de-
pression, insomnia, diabetes, obesity, heart disease, anxiety disorder, etc.) using the Twit-
ter streaming API13, which provided approximately 1% of all publicly available tweets.
As preprocessing steps, we removed non-English tweets, tokenized the text, normalized
to lowercase, and replaced hyperlinks, numerics and Twitter screen names with special
tokens: ‘URL’, ‘NUMBER’ and ‘USER’.
Table 6.5 shows medical concepts and examples of top 20 similar words by cosine
similarity based on the word embeddings built with individual data source.
6.5. Results
Table 6.6. Classification Accuracy (%) using 10-fold cross validation (TH
= thesaurus, MD = medical dictionary, CT = clinical texts, HT = health-
related tweets, batch size = 50, number of epoch = 100, vector dimension
= 300)
TwADR-L TwADR-L AskAPatient AskAPatient

Word Embeddings CNN RNN CNN RNN
Rand 16.06 22.05 40.95 58.54
GNews 15.57 23.17 45.73 64.41
TH 14.43 20.43 32.66 57.17
MD 15.73 19.62 41.90 58.26
CT 14.77 22.21 45.49 61.81
HT 16.69 24.63 45.46 64.08
TH + MD + CT + HT 19.46 25.30 55.46 65.04
Table 6.6 shows the accuracy of classification models using 10-fold cross validation,
averaged over ten folds. The first two rows are our baseline models14 [70] where CNN and
13https://dev.twitter.com/streaming/public
14Code available at https://github.com/nutli/concept_normalisation
98
RNN models use a randomly generated embeddings (Rand) and a publicly available pre-
trained word embeddings generated from 100 billion words from Google News (GNews)
using word2vec [77] as inputs. The next four rows (rows 3-6) present the performance
of the same CNN and RNN as the baseline models but using word embeddings we built
on top of various clinical texts described in section 6.4.2. The last row presents the
performance when the models use word embeddings built using combination of all four
data sources as an input. All experiments including the baseline models were trained and
evaluated on the cleaned and newly-created datasets (described in section 6.4.1).
Among the individual datasets (TH, MD, CT, HT), the health-related tweets (HT)
had the most significant impact on the classification performance. Both the CNN and the
RNN models performed comparable to (for AskAPatient dataset) or better (for TwADR-
L dataset) than the best baseline models. When we combined all individual datasets, it
largely improved the classification accuracy over all baseline models and all our models.
Compared to the best baseline accuracy, the improvement was +21.17% on TwADR-
L CNN, +9.19% on TwADR-L RNN, +21.28% on AskAPatient CNN, and +0.98% on
AskAPatient RNN. The improvement was substantial for CNN. For all models, we used
the following hyperparameters: batch size = 50, number of epochs = 100, vector dimension
= 300, number of neurons in hidden layer = 100, dropout rate = 0.5, non-linear activation
function = rectifier, and max-pooling for CNN.
6.5.1. Ablation Study
Next we conducted experiments to study the effects of removing a dataset from training.
Table 6.7 presents the performance loss when each dataset is removed from the set of all
99
Table 6.7. Ablation Study. Comparison of models’ accuracy (%) when a

feature is removed from all possible feature sets (TH = thesaurus, MD =
medical dictionary, CT = clinical texts, HT = health-related tweets). The
numbers in parenthesis indicate the performance drop when the feature is
removed.
TwADR-L TwADR-L AskAPatient AskAPatient

Word Embeddings CNN RNN CNN RNN
All - HT 18.80 (-0.66) 22.54 (-2.76) 46.37 (-9.09) 62.97 (-2.07)
All - TH 16.38 (-3.08) 25.44 (+0.14) 45.29 (-10.17) 62.96 (-2.08)
All - CT 15.58 (-3.88) 24.96 (-0.34) 45.61 (-9.85) 64.09 (-0.95)
All - MD 17.69 (-1.77) 26.60 (+1.3) 44.50 (-10.96) 63.93 (-1.11)
All 19.46 25.30 55.46 65.04
possible resources (TH + MD + CT + HT). Interestingly, each of the four data sources
appeared to be the most important for different deep learning models and datasets. The
performance dropped by 3.88% (from 19.46% to 15.58%) when clinical texts (CT) was
removed, indicating that CT is the most important feature for TwADR-L CNN among
the four individual features. For TwADR-L RNN, health-related tweets (HT) was the
most helpful feature, indicated by the performance drop of 2.76% when removed.
While the definitions from the medical dictionary (MD) contributed the most for
AskAPatient CNN model (with 10.96% performance drop when removed), the definitions,
synonyms, and antonyms from the thesaurus (TH) was the most significant feature for the
AskAPatient RNN model (with 2.08% performance drop when removed). These results
indicate that each text data from different healthcare domain is very helpful for the deep
learning models learn clinical semantics for normalization. Word embeddings built with
the larger dataset that combined texts from multiple healthcare domains significantly con-
tributed to improving model’s performance across both Twitter and AskAPatient datasets
when compared to that built from a larger general domain corpus like google news.
100
Table 6.8. TwADR-L examples that should have multiple labels
Social Media Phrase CUI Concept CUI Concept

Gold Gold Predicted Predicted
feel like crap C0011570 mental depression C0344315 depressed mood
not being able to eat C1971624 loss of appetite C0232462 decrease in appetite
feeling weird C1443060 feeling abnormal C0278061 abnormal mental state
depressive emotions and thoughts C0011570 mental depression C0086132 depressive symptoms
wide awake C0455769 energy increased C0043012 wakefulness
6.5.2. Qualitative Analysis
Table 6.8 shows examples that our best model incorrectly predicted. The first column
shows example phrases of social media posts that describe medical conditions, the sec-
ond and the third columns show the annotated CUIs (unique concept identifiers) and
corresponding medical concept descriptions, and the fourth and fifth columns show the
predicted CUIs and corresponding concept descriptions by our best model (TH + MD +
CT + HT). These examples are false positives based on the ground truth labels (i.e., the
predicted CUIs do not match the labeled CUIs). However, we can observe that, although
the CUIs are different, the social media phrases can actually be mapped to both predicted
and labeled concepts. For example, the predicted concept ‘decrease in appetite’ and the
label ‘loss of appetite’ have similar meanings, therefore predicting the phrase ‘not being
able to eat’ as the concept ‘decrease in appetite’ should be considered correct. While
some phrases in the dataset have multiple labels, there are still many more that should
have multiple labels (such as those shown in Table 6.8).
This suggests several future directions for designing a normalization system. First, it
is necessary to have a list of CUIs that represent similar medical concepts so that, when a
normalization system predicts a CUI, the mapping can automatically be associated with
other CUIs in the same set. Second, the normalization task should be cast as a multi-class
101
multi-label classification problem since each phrase can be mapped to multiple concepts
(as shown in Tables 6.3 and 6.8) and each concept can have many social media phrases
(as shown in Table 6.1).
6.6. Summary
In this work, we explored building neural word embeddings using unlabeled text data
from various clinical knowledge sources for medical concept normalization from user-
generated social media texts. We showed that two deep learning models (CNN and
RNN) could better predict the medical concepts when we used various clinical domain-
specific neural embeddings compared to embeddings trained on a larger general domain
text corpus. Our experiments showed that the proposed models with neural embeddings
trained on the combined clinical data sources could improve the accuracy up to 21.17%
on the Twitter data set and up to 21.28% on the AskAPatient data set.
102
CHAPTER 7
Conclusion and Future Research Work
Social media is an invaluable resource for mining healthcare insights. In this thesis,
we presented intelligent systems we built using Twitter data for retrieving health-related
information, monitoring and predicting disease activities, and normalizing medical con-
cepts. We proposed a multi-class classification model that classified trending topics or
Twitter posts into 18 general categories. Although both our approaches, bag-of-words and
network-based, effectively classified topics with high accuracy, the network-based model
using categories of similar topics had shown to achieve superior classification performance.
This model could help search information in a specific domain such as health. We also
discussed our contributions towards building a real-time disease surveillance system using
spatial, temporal, and text mining on Twitter data. The proposed system could effec-
tively track daily disease activities and map U.S. regional disease levels near real-time.
Although our work had focused on tracking three diseases (allergy, cancer, flu), the model
could be easily adapted to track other diseases. We further built a neural network model
that predicted current and future influenza activities with high accuracy by combining big
real-time social media data and observed CDC data to build predictive models. Finally,
we investigated normalizing health conditions described in the colloquial language to stan-
dard medical terminologies in Unified Medical Language System (UMLS). By training two
deep learning, Convolutional neural network (CNN) and recurrent neural network (RNN),
103
models on various clinical knowledge sources, we were able to achieve significantly better
results over the baseline techniques.
Although some pioneering works have been done, there still remain many challenges in
mining social media to gain healthcare insights. Medical concept extraction - identifying
phrases that describe health conditions - is a challenging task due to many different
ways of describing the same condition and the colloquial language used in social media.
Generating novel models for medical concept extraction using advanced natural language
processing (NLP) techniques and deep learning would be an interesting research area for
future work. Such models would be helpful for automatic systems to understand users’
health issues or clinical questions accurately such that they can provide more relevant
information to the users. We are also interested in automatically detecting mentions
of adverse drug events (ADE), negative side effects that occur as a result of medical
interventions, from social media.
We proposed a number of techniques that could be useful for collecting, analyzing
and predicting health-related information from real-time social media. However, there
still remain many challenging problems related to mining user generated contents for
healthcare insights. We hope that the techniques we proposed in this thesis can be used
as a stepping stone to further address some of those research questions.

104
References
[1] K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, and A. Choud-
hary. Twitter trending topic classification. In Data Mining Workshops (ICDMW),
2011 IEEE 11th International Conference on, pages 251–258. IEEE, 2011.
[2] K. Lee, A. Agrawal, and A. Choudhary. Real-time digital flu surveillance using
twitter data. In The 2nd Workshop on Data Mining for Medicine and Healthcare,
2013.
[3] K. Lee, A. Agrawal, and A. Choudhary. Real-time disease surveillance using twitter
data: Demonstration on flu and cancer. In Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’13,
pages 1474–1477, New York, NY, USA, 2013. ACM.
[4] K. Lee, A. Agrawal, and A. Choudhary. Mining social media streams to improve
public health allergy surveillance. In 2015 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), pages 815–822, Aug
2015.
[5] K. Lee, A. Agrawal, and A. Choudhary. Forecasting influenza levels using real-
time social media streams. In 2017 IEEE International Conference on Healthcare
Informatics (ICHI), pages 409–414, Aug 2017.

105
[6] K. Lee, S. A. Hasan, O. Farri, A. Choudhary, and A. Agrawal. Medical concept nor-
malization for online user-generated texts. In 2017 IEEE International Conference
on Healthcare Informatics (ICHI), pages 462–469, Aug 2017.
[7] T. Zhu, H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. N. Choudhary.
Beating the artificial chaos: Fighting osn spam using its own templates. IEEE/ACM
Transactions on Networking, 24(6):3856–3869, December 2016.
[8] H. Gao, Y. Yang, K. Bu, Y. Chen, D. Downey, K. Lee, and A. Choudhary. Spam
ain’t as diverse as it seems: Throttling osn spam with templates underneath. In
Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC
’14, pages 76–85, New York, NY, USA, 2014. ACM.
[9] D. Palsetia, M. Mostofa, A. Patwary, K. Zhang, K. Lee, C. Moran, Y. Xie, D. Honbo,
A. Agrawal, W. keng Liao, and A. Choudhary. User-interest based community ex-
traction in social networks. 2012.
[10] A. Choudhary, W. Hendrix, K. Lee, D. Palsetia, and W.-K. Liao. Social media
evolution of the egyptian revolution. Commun. ACM, 55(5):74–80, May 2012.
[11] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Towards online spam fil-
tering in social networks. In Proceedings of the 19th Annual Network and Distributed
System Security Symposium, 2012.
[12] K. Zhang, Y. Cheng, Y. Xie, D. Honbo, A. Agrawal, D. Palsetia, K. Lee, W. k. Liao,
and A. Choudhary. Ses: Sentiment elicitation system for social media data. In 2011
106
IEEE 11th International Conference on Data Mining Workshops, pages 129–136,
Dec 2011.
[13] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. Choudhary. Poster: Online spam
filtering in social networks. In Proceedings of the 18th ACM Conference on Computer
and Communications Security, CCS ’11, pages 769–772, New York, NY, USA, 2011.
ACM.
[14] Real-time digital allergy surveillance. http://pulse.eecs.northwestern.edu/
~kml649/allergy/.
[15] Real-time digital cancer surveillance. http://pulse.eecs.northwestern.edu/
~kml649/cancer/.
[16] Real-time digital flu surveillance. http://pulse.eecs.northwestern.edu/
~kml649/flu/.
[17] Twitter streaming api. https://dev.twitter.com/docs/streaming-apis.
[18] Centers for Disease Control and Prevention, seasonal influenza (flu). http://www.
cdc.gov/flu, 2012.
[19] World of DTC Marketing.com, web first place people go for health information. but
you knew that already didn’t you. http://worldofdtcmarketing.com, 2012.

107
[20] The Huffington Post, michigan flu season 2013: Four children die in
influenza outbreak of ah3n2. http://www.huffingtonpost.com/2013/01/12/
michigan-flu-season-2013-ah3n2_n_2458916.html, 2013.
[21] USA Today, 700 cases of flu prompt boston to declare emer-
gency. http://www.usatoday.com/story/news/nation/2013/01/09/
boston-declares-flu-emergency/1820975, 2013.
[22] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Predicting flu trends us-
ing twitter data. In Computer Communications Workshops (INFOCOM WKSHPS),
2011 IEEE Conference on, 2011.
[23] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine
learning, 6(1):37–66, 1991.
[24] E. Aramaki, S. Maskawa, and M. Morita. Twitter Catches the Flu: Detecting In-
fluenza Epidemics Using Twitter. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 1568–1576, 2011.
[25] A. R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus:
the MetaMap program. Proceedings / AMIA ... Annual Symposium. AMIA Sympo-
sium, pages 17–21, 2001.
[26] S. Asur and B. A. Huberman. Predicting the Future with Social Media. In Proceed-
ings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence
and Intelligent Agent Technology - Volume 01, pages 492–499, 2010.

108
[27] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event
identification on twitter. In Proceedings of AAAI, 2011.
[28] Y. Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with
Gradient Descent is Difficult. IEEE Transactions on Neural Networks, 5(2):157–
166, 1994.
[29] D. L. Blackwell, J. W. Lucas, and T. C. Clarke. Summary health statistics for u.s.
adults: National health interview survey, 2012. http://www.cdc.gov/nchs/data/
series/sr_10/sr10_260.pdf, 2013.
[30] B. Bloom, L. I. Jones, and G. Freeman. Summary health statistics for u.s. children:
National health interview survey, 2012. http://www.cdc.gov/nchs/data/series/
sr_10/sr10_258.pdf, 2012.
[31] O. Bodenreider. The unified medical language system (umls): integrating biomedical
terminology. Nucleic acids research, 32:D267–D270, 2004.
[32] J. Bollen and H. Mao. Twitter mood as a stock market predictor. Computer,
44(10):91–94, 2011.
[33] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal
of Computational Science, 2(1):1 – 8, 2011.
[34] E. G. Brown, L. Wood, and S. Wood. The medical dictionary for regulatory activities
(meddra). Drug Safety, 20(2):109–117, 2012.

109
[35] D. Butler. When Google got flu wrong. Nature, 494(7436):155–156, Feb. 2013.
[36] M. Cai, Q. Xu, Y. Pan, W. Pan, N. Ji, Y. Li, H. Jin, K. Liu, and Z. Ji. Adrecs:
an ontology database for aiding standardization and hierarchical classification of
adverse drug reaction terms. Nucleic Acids Research, 43(Database-Issue):907–913,
2015.
[37] D. Capurro, K. Cole, I. M. Echavarrı́a, J. Joe, T. Neogi, and M. A. Turner. The Use
of Social Networking Sites for Public Health Practice and Research: A Systematic
Review. J Med Internet Res, 16(3):e79, Mar 2014.
[38] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Chakraborty,
P. Khadivi, B. Lewis, A. Mahendiran, and J. C. and. Forecasting a Moving Target:
Ensemble Models for ILI Case Count Predictions. In SDM, 2014.
[39] L. Chen, H. Achrekar, B. Liu, and R. Lazarus. Vision: Towards real time epidemic
vigilance through online social networks: Introducing sneft – social network enabled
flu trends. In Proceedings of the 1st ACM Workshop on Mobile Cloud Computing
& Services: Social Networks and Beyond, MCS ’10, pages 4:1–4:5, New York,
NY, USA, 2010. ACM.
[40] L. Chen, K. S. M. T. Hossain, P. Butler, N. Ramakrishnan, and B. A. Prakash. Flu
gone viral: Syndromic surveillance of flu on twitter using temporal topic models.
In 2014 IEEE International Conference on Data Mining, ICDM 2014, Shenzhen,
China, December 14-17, 2014, pages 755–760, 2014.

110
[41] C. Chew and G. Eysenbach. Pandemics in the Age of Twitter: Content Analysis of
Tweets during the 2009 H1N1 Outbreak. PLoS ONE, 5(11):e14118, 11 2010.
[42] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the Properties of Neu-
ral Machine Translation: Encoder–Decoder Approaches. In Proceedings of SSST-8,
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation,
pages 103–111, 2014.
[43] N. A. Christakis and J. H. Fowler. Social network sensors for early detection of
contagious outbreaks. PloS one, 5(9):e12948, 2010.
[44] N. Cristianini and J. Shawe-Taylor. An introduction to support Vector Machines:
and other kernel-based learning methods. Cambridge Univ Pr, 2000.
[45] L. de Weger, T. Beerthuizen, P. Hiemstra, and J. Sont. Development and validation
of a 5-day-ahead hay fever forecast for patients with grass-pollen-induced allergic
rhinitis. International Journal of Biometeorology, 58(6):1047–1055, 2014.
[46] J. Emberlin, J. Mullins, J. Corden, W. Millington, M. Brooke, M. Savage, and
S. Jones. The trend to earlier birch pollen seasons in the uk: a biotic response to
changes in weather conditions? Grana, 36(1):29–33, 1997.
[47] J. U. Espino, W. R. Hogan, and M. M. Wagner. Telephone triage: a timely data
source for surveillance of influenza-like diseases. In AMIA Annual Symposium Pro-
ceedings, page 215, 2003.

111
[48] G. Eysenbach. Infodemiology: tracking flu-related searches on the web for syndromic
surveillance. In AMIA Annual Symposium Proceedings, page 244, 2006.
[49] Y. Genc, Y. Sakamoto, and J. V. Nickerson. Discovering context: Classifying tweets
through a semantic transform based on wikipedia. In Proceedings of HCI Interna-
tional, 2011.
[50] J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, and L. Brilliant.
Detecting influenza epidemics using search engine query data. Nature, 457:1012–
1014, 2009.
[51] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant
supervision, 2009.
[52] H. Gurulingappa, A. M. Rajput, A. Roberts, J. Fluck, M. Hofmann-Apitius, and
L. Toldo. Development of a benchmark corpus to support the automatic extraction
of drug-related adverse effects from medical case reports. Journal of Biomedical
Informatics, pages 885 – 892, 2012.
[53] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The
WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 11(1):10–18,
Nov. 2009.
[54] S. A. Hasan, B. Liu, J. Liu, A. Qadir, K. Lee, V. Datla, A. Prakash, and O. Farri.
Neural clinical paraphrase generation with attention. ClinicalNLP 2016, page 42,
2016.
112
[55] A. Hulth, G. Rydevik, and A. Linde. Web queries as a source for syndromic surveil-
lance. PloS one, 4(2):e4378, 2009.
[56] IBM SPSS Modeler. http://www-01.ibm.com/software/analytics/spss/
products/modeler/.
[57] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time
of Outbreaks. In Proceedings of the 22Nd International Conference on World Wide
Web Companion, pages 1335–1342, 2013.
[58] S. Karimi, A. Metke-Jimenez, M. Kemp, and C. Wang. Cadec: A corpus of adverse
drug event annotations. Journal of Biomedical Informatics, 55:73 – 81, 2015.
[59] Y. Kim. Convolutional neural networks for sentence classification. In Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), Doha, Qatar, 2014.
[60] S. Kinsella, A. Passant, and J. G. Breslin. Topic classification in social media using
metadata from hyperlinked objects. In Proceedings of the 33rd European conference
on Advances in information retrieval, pages 201–206, 2011.
[61] P. Kostkova. A Roadmap to Integrated Digital Public Health Surveillance: The
Vision and the Challenges. In Proceedings of the 22nd International Conference on
World Wide Web Companion, pages 687–694, 2013.
[62] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork. The SIDER database of drugs and
side effects. Nucleic Acids Research, 44(Database-Issue):1075–1079, 2016.

113
[63] A. Lamb, M. J. Paul, and M. Dredze. Separating Fact from Fear: Tracking Flu
Infections on Twitter. In HLT-NAACL, pages 789–795, 2013.
[64] C. E. Lamb, P. H. Ratner, C. E. Johnson, A. J. Ambegaonkar, A. V. Joshi, D. Day,
N. Sampson, and B. Eng. Economic impact of workplace productivity losses due to
allergic rhinitis compared with select medical conditions in the united states from
an employer perspective. Current Medical Research and Opinion, 22(6):1203–1210,
2006. PMID: 16846553.
[65] V. Lampos, T. De Bie, and N. Cristianini. Flu detector-tracking epidemics on Twit-
ter. In Machine Learning and Knowledge Discovery in Databases, pages 599–602.
2010.
[66] S. Le Cessie and J. Van Houwelingen. Ridge estimators in logistic regression. Applied
Statistics, pages 191–201, 1992.
[67] R. Leaman, R. I. Dogan, and Z. Lu. Dnorm: disease name normalization with
pairwise learning to rank. Bioinformatics, 29(22):2909–2917, 2013.
[68] K. Lee, A. Qadir, S. A. Hasan, V. Datla, a. prakash, J. Liu, and O. Farri. Ad-
verse drug event detection in tweets with semi-supervised convolutional neural net-
works. In Proceedings of the Twenty-Sixth International World Wide Web conference
(WWW 2017), Perth, Australia, 2017.
[69] J. Li and C. Cardie. Early Stage Influenza Detection from Twitter. arXiv preprint
arXiv:1309.7340, 2013.
114
[70] N. Limsopatham and N. Collier. Normalising medical concepts in social media texts
by learning semantic representation. In Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,
Berlin, Germany, Volume 1: Long Papers, 2016.
[71] D. Lindberg, B. Humphreys, and A. McCray. The Unified Medical Language System.
Methods of Information in Medicine, 32(4):281–291, 1993.
[72] S. Magruder. Evaluation of over-the-counter pharmaceutical sales as a possible early
warning indicator of human disease. Johns Hopkins APL technical digest, 24(4):349–
53, 2003.
[73] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA, 2008.
[74] A. McCallum, K. Bellare, and F. C. N. Pereira. A conditional random field
for discriminatively-trained finite-state string edit distance. CoRR, abs/1207.1406,
2012.
[75] A. McCallum and K. Nigam. A comparison of event models for naive bayes text
classification. In IN AAAI-98 WORKSHOP ON LEARNING FOR TEXT CATE-
GORIZATION, pages 41–48. AAAI Press, 1998.
[76] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-
sentations in vector space. CoRR, abs/1301.3781, 2013.

115
[77] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed represen-
tations of words and phrases and their compositionality. In Proceedings of the 27th
Annual Conference on Neural Information Processing Systems NIPS 2013, 2013.
[78] R. Narayanan. Mining Text for Relationship Extraction and Sentiment Analysis.
PhD thesis, 2010.
[79] A. Nikfarjam, A. Sarker, K. O’Connor, R. Ginn, and G. Gonzalez. Pharmacovig-
ilance from social media: mining adverse drug reaction mentions using sequence
labeling with word embedding cluster features. Journal of the American Medical
Informatics Association, 22:671–681, 2015.
[80] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using
machine learning techniques. In Proceedings of the ACL-02 conference on Empirical
methods in natural language processing-Volume 10, pages 79–86. Association for
Computational Linguistics, 2002.
[81] M. J. Paul and M. Dredze. You Are What You Tweet: Analyzing Twitter for Public
Health. In ICWSM, 2011.
[82] R. Pawankar, G. W. Canonica, S. T. Holgate, and R. F. Lockey.
Wao white book on allergy. http://www.worldallergy.org/UserFiles/file/
WAO-White-Book-on-Allergy_web.pdf, 2011.
[83] F. Pervaiz, M. Pervaiz, N. Abdur Rehman, and U. Saif. FluBreaks: Early Epidemic
Detection from Google Flu Trends. J Med Internet Res, 14(5):e125, Oct 2012.
116
[84] P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A. Weinstein. Using
internet searches for influenza surveillance. Clinical infectious diseases, 47(11):1443–
1448, 2008.
[85] A. Prakash, S. A. Hasan, K. Lee, V. V. Datla, A. Qadir, J. Liu, and O. Farri.
Neural paraphrase generation with stacked residual LSTM networks. In COLING
2016, 26th International Conference on Computational Linguistics, Proceedings of
the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages
2923–2934, 2016.
[86] A. Prakash, S. Zhao, S. A. Hasan, V. Datla, K. Lee, A. Qadir, and O. F. Joey Liu.
Condensed memory networks for clinical diagnostic inferencing. In The 31st AAAI
Conference on Artificial Intelligence (AAAI 2017), 2017.
[87] J. Quinlan. Improved use of continuous attributes in c4.5. Arxiv preprint
cs/9603103, 1996.
[88] E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 20(5):522–532, May 1998.
[89] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time
event detection by social sensors. In Proceedings of the 19th international conference
on World wide web, pages 851–860, 2010.

117
[90] H. Sampathkumar, X. Chen, and B. Luo. Mining adverse drug reactions from on-
line healthcare forums using hidden markov model. BMC Medical Informatics and
Decision Making, 14(1):1–18, 2014.
[91] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling.
Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL Inter-
national Conference on Advances in Geographic Information Systems, pages 42–51,
2009.
[92] A. Sarker and G. Gonzalez. Portable automatic text classification for adverse drug
reaction detection via multi-corpus training. Journal of Biomedical Informatics,
53:196 – 207, 2015.
[93] J. Shaman, A. Karspeck, W. Yang, J. Tamerius, and M. Lipsitch. Real-time in-
fluenza forecasts during the 2012–2013 season. Nature Communications, 4, Dec.
2013.
[94] R. L. Siegel, K. D. Miller, and A. Jemal. Cancer statistics, 2017. CA: A Cancer
Journal for Clinicians, 67(1):7–30, 2017.
[95] A. Signorini, A. M. Segre, and P. M. Polgreen. The Use of Twitter to Track Levels
of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1
Pandemic. PLoS ONE, 6(5):e19467, 05 2011.
[96] M. Sofean and M. Smith. A real-time architecture for detection of diseases using
social networks: Design, implementation and evaluation. In Proceedings of the 23rd

118
ACM Conference on Hypertext and Social Media, HT ’12, pages 309–310, New York,
NY, USA, 2012. ACM.
[97] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text
classification in twitter to improve information filtering. In Proceeding of the 33rd
international ACM SIGIR conference on Research and development in information
retrieval, pages 841–842, 2010.
[98] R. Sugumaran and J. Voss. Real-time Spatio-temporal Analysis of West Nile Virus
Using Twitter Data. In Proceedings of the 3rd International Conference on Com-
puting for Geospatial Research and Applications, pages 39:1–39:2, 2012.
[99] S. Tuarob, C. S. Tucker, M. Salathe, and N. Ram. Discovering health-related knowl-
edge in social media using ensembles of heterogeneous features. In Proceedings of
the 22Nd ACM International Conference on Information & Knowledge Manage-
ment, CIKM ’13, pages 1685–1690, New York, NY, USA, 2013. ACM.
[100] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka/.
[101] M. Wilson, S. Villalba, H. Avila, J. Hahn, and A. Cepeda. Correlation between
atmospheric tree pollen levels with three weather variables during 2002-2004 in a
tropical urban area. Journal of Allergy and Clinical Immunology, 127(2):AB170,
2011.
[102] W. Xing and A. Ghorbani. Weighted pagerank algorithm. 2004.

119
[103] S. R. Yerva, Z. Miklós, and K. Aberer. What have fruits to do with technology?: the
case of orange, blackberry and apple. In Proceedings of the International Conference
on Web Intelligence, Mining and Semantics, 2011.
[104] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S. Brownstein. Mon-
itoring Influenza Epidemics in China with Search Query from Baidu. PLoS ONE,
8:e64323, 05 2013.

Mining Social Media For Healthcare Intelligence PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mining Social Media For Healthcare Intelligence PDF

Uploaded by

Copyright:

Available Formats

NORTHWESTERN UNIVERSITY

Mining Social Media for Healthcare Intelligence

SUBMITTED TO THE GRADUATE SCHOOL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

for the degree

Field of Computer Science

All Rights Reserved

Mining Social Media for Healthcare Intelligence

for disseminating information on news, personal interests, experiences, and opinions. On

these characteristics pose many interesting research questions.

information retrieval tasks by reducing the search space to a domain of interest. We

momentum in my research. I would also like to express my deep gratitude towards my

information security (CUCIS) lab at Northwestern University for collaborating with me

and providing me an invaluable intellectual support.

Chapter 2. Twitter Trending Topic Classification 22

2.2. Related Works 25

2.3. Data and Methods 26

2.3.1. Data Collection 27

2.3.3. Data Modeling 30

2.3.3.1. Text-based Data Modeling 30

2.3.3.2. Network-based Data Modeling 31

2.3.4. Machine Learning 33

2.4. Experiments and Results 34

2.4.1. Text-based classification 34

2.4.2. Network-based classification 35

Chapter 3. Mining Social Media Streams to Improve Public Health

3.2. Our Approach 40

3.2.1.1. Twitter dataset 40

3.2.1.2. Ground Truth Data 40

3.2.2.1. Data Preprocessing 41

3.2.2.2. Data Classification 41

3.2.2.3. Text Mining 44

3.2.2.4. Spatio-temporal Mining 46

3.3. Experimental Results 48

3.3.1. Text Analysis 48

3.3.2. Spatio-Temporal Analysis 50

3.4. Related Works 55

Chapter 4. Real-Time Digital Diseases Surveillance using Twitter Data

Demonstration on Flu and Cancer 59

4.2. System Description 60

4.2.1. Geographical Analysis 62

4.2.2. Temporal Analysis 63

4.2.3. Text Analysis 66

Chapter 5. Forecasting Influenza Levels using Real-Time Social Media Streams 68

5.2. Related Work 70

5.3.2. Data Preprocessing 73

5.3.3. Feature Selection 74

5.3.4. Predictive Modeling 77

Chapter 6. Medical Concept Normalization 82

6.2. Related Work 85

6.2.1. Social Media for Healthcare 85

6.2.2. Deep Neural Network Models 85

6.2.3. Concept Normalization 86

6.3. Model Description 87

6.3.1. Convolutional Neural Network (CNN) 87

6.3.2. Recurrent Neural Network (RNN) 88

6.4. Experimental Setup 90

6.4.2. Data Sources for Word Embedding 92

6.4.2.1. Thesaurus (TH) 93

6.4.2.2. Medical Dictionary (MD) 94

6.4.2.3. Clinical Texts (CT) 94

6.4.2.4. Health-related Tweets (HT) 97

6.5.1. Ablation Study 98