You are on page 1of 6

Identifying and Categorizing Disaster-Related Tweets

Kevin Stowe, Michael Paul, Martha Palmer, Leysia Palen, Ken Anderson
University of Colorado, Boulder, CO 80309
[kest1439, mpaul, mpalmer, palen, kena]@colorado.edu

Abstract is of critical importance for improving risk commu-


nication and protective decision-making leading up
This paper presents a system for classifying to and during disasters, and thus for reducing harm
disaster-related tweets. The focus is on Twit- (Demuth et al., 2012).
ter data generated before, during, and after Our experiments show that such tweets can be
Hurricane Sandy, which impacted New York classified accurately, and that combining a variety of
in the fall of 2012. We propose an annotation
linguistic and contextual features can substantially
schema for identifying relevant tweets as well
as the more fine-grained categories they rep- improve classifier performance.
resent, and develop feature-rich classifiers for
relevance and fine-grained categorization. 2 Related Work
2.1 Analyzing Disasters with Social Media
1 Introduction A number of researchers have used social media as
a data source to understand various disasters (Yin
Social media provides a powerful lens for identify- et al., 2012; Kogan et al., 2015), with applications
ing people’s behavior, decision-making, and infor- such as situational awareness (Vieweg et al., 2010;
mation sources before, during, and after wide-scope Bennett et al., 2013) and understanding public sen-
events, such as natural disasters (Becker et al., 2010; timent (Doan et al., 2012). For a survey of social
Imran et al., 2014). This information is important for media analysis for disasters, see Imran et al. (2014).
identiying what information is propagated through Closely related to this work is that of Verma et
which channels, and what actions and decisions peo- al. (2011), who constructed classifiers to identify
ple pursue. However, so much information is gen- tweets that demonstrate situational awareness in four
erated from social media services like Twitter that datasets (Red River floods of 2009 and 2010, the
filtering of noise becomes necessary. Haiti earthquake of 2010, and Oklahoma fires of
Focusing on the 2012 Hurricane Sandy event, this 2009). Situational awareness is important for those
paper presents classification methods for (i) filtering analyzing social media data, but it does not encom-
tweets relevant to the disaster, and (ii) categorizing pass the entirety of people’s reactions. A primary
relevant tweets into fine-grained categories such as goal of our work is to capture tweets that relate to a
preparation and evacuation. This type of automatic hazard event, regardless of situational awareness.
tweet categorization can be useful both during and
after disaster events. During events, tweets can help 2.2 Tweet Classification
crisis managers, first responders, and others take ef- Identifying relevant information in social media is
fective action. After the event, analysts can use so- challenging due to the low signal-to-noise ratio. A
cial media information to understand people’s be- number of researchers have used NLP to address this
havior during the event. This type of understanding challenge. There is significant work in the medi-

Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pages 1–6,
c
Austin, TX, November 1, 2016. 2016 Association for Computational Linguistics
cal domain related to identifying health crises and contained 7,490 tweets from 93 users, covering a
events in social media data. Multiple studies have 17 day time period starting one week before land-
been done to analyze flu-related tweets (Culotta, fall (October 23rd to November 10th). Most tweets
2010; Aramaki et al., 2011). Most closely related to were irrelevant: Halloween, as well as the upcom-
our work (but in a different domain) is the flu clas- ing presidential election, yielded a large number of
sification system of Lamb et al. (2013), which first tweets not related to the storm, despite the collection
classifies tweets for relevance and then applies finer- bias toward Twitter users from affected areas.
grained classifiers.
Similar systems have been developed to catego- 3.2 Annotation Schema
rize tweets in more general domains, for example by Tweets were annotated with a fine-grained, multi-
identifying tweets related to news, events, and opin- label schema developed in an iterative process with
ions (Sankaranarayanan et al., 2009; Sriram et al., domain experts, social scientists, and linguists who
2010). Similar classifiers have been developed for are members of our larger project team. The schema
sentiment analysis (Pang and Lee, 2008) to identify was designed to annotate tweets that reflect the atti-
and categorize sentiment-expressing tweets (Go et tudes, information sources, and protective decision-
al., 2009; Kouloumpis et al., 2011). making behavior of those tweeting. This schema is
not exhaustive—anything deemed relevant that did
3 Data not fall into an annotation category was marked as
3.1 Collection Other—but it is much richer than previous work.
Tweets that were not labeled with any category were
In late October 2012, Hurricane Sandy generated a
considered irrelevant (and as such, considered neg-
massive, disperse reaction in social media channels,
ative examples for relevance classification). Two
with many users expressing their thoughts and ac-
additional categories, reporting on family members
tions taken before, during, and after the storm. We
and referring to previous hurricane events, were seen
performed a keyword collection for this event cap-
as important to the event, but were very rare in the
turing all tweets using the following keywords from
data (34 of 7,490 total tweets). The categories iden-
October 23, 2012 to April 5, 2013:
tified and annotated are as follows: Tweets could be
DSNY, cleanup, debris, frankenstorm, garbage, hur- labeled with any of the following:
ricane, hurricanesandy, lbi, occupysandy, perfect-
storm, sandy, sandycam, stormporn, superstorm Sentiment Tweets that express emotions or per-
sonal reactions towards the event, such as humor,
22.2M unique tweets were collected from 8M excitement, frustration, worry, condolences, etc.
unique Twitter users. We then identified 100K
Action Tweets that describe physical actions taken
users with a geo-located tweet in the time leading
to prepare for the event, such as powering phones,
up to the landfall of the hurricane, and gathered all
acquiring generators or alternative power sources,
tweets generated by those users creating a dataset
and buying other supplies.
of 205M tweets produced by 92.2K users. We ran-
domly selected 100 users from approximately 8,000 Preparation Tweets that describe making plans in
users who: (i) tweeted at least 50 times during the preparation for the storm, including those involving
data collection period, and (ii) posted at least 3 geo- altering plans.
tagged tweets from within the mandatory evacua- Reporting Tweets that report first-hand informa-
tion zones in New York City. It’s critical to filter tion available to the tweeter, including reporting on
the dataset to focus on users that were at high risk, the weather and the environment around them, as
and this first pass allowed us to lower the percentage well as the observed social situations.
of users that were not in the area and thus not af- Information Tweets that share or seek informa-
fected by the event. Our dataset includes all tweets tion from others (including public officials). This
from these users, not just tweets containing the key- category is distinct from Reporting in that it only in-
words. Seven users were removed for having pre- cludes information received or request from outside
dominately non-English tweets. The final dataset sources, and not information perceived first-hand.

2
Movement Tweets that mention evacuation or Category Count % tweets Agreement
sheltering behavior, including mentions of leaving, Relevance
staying in place, or returning from another location. Relevance 1757 23.5% 48.6% (κ=.569)
Fine-Grained Annotations
Tweets about movement are rare, but especially im-
Reporting 1369 77.9% 80.2% (κ=.833)
portant in determining a user’s response to the event. Sentiment 786 44.7% 71.8% (κ=.798)
Information 600 34.1% 89.8% (κ=.934)
3.3 Annotation Results Action 295 16.8% 72.5% (κ=.827)
Preparation 188 10.7% 41.1% (κ=.565)
Two annotators were trained by domain experts us- Movement 53 3.0% 43.3% (κ=.600)
ing 726 tweets collected for ten Twitter users. Anno- Table 1: The number and percentage of tweets for each label,
tation involved a two-step process: first, tweets were along with annotator agreement.
labeled for relevance, and then relevant tweets were
labeled with the fine-grained categories described
above. The annotators were instructed to use the 4.1 Model Selection
linguistic information, including context of previ- Our baseline features are the counts of unigrams
ous and following tweets, as well as the informa- in tweets, after preprocessing to remove capitaliza-
tion present in links and images, to determine the tion, punctuation and stopwords. We initially exper-
appropriate category. A third annotator provided a imented with different classification models and fea-
deciding vote to resolve disagreements. ture selection methods using unigrams for relevance
Table 1 shows the label proportions and annota- classification. We then used the best-performing ap-
tor agreement for the different tasks. Because each proach for the rest of our experiments. 10% of the
tweet could belong to multiple categories, κ scores data was held out as a development set to use for
were calculated based on agreement per category: if these initial experiments, including parameter opti-
a tweet was marked by both annotators as a particu- mization (e.g., SVM regularization).
lar category, it was marked as agreement for that cat- We assessed three classification models that have
egory. Agreement was only moderate for relevance been successful in similar work (Verma et al., 2011;
(κ = .569). Many tweets did not contain enough Go et al., 2009): support vector machines (SVMs),
information to easily distinguish them, for example: maximum entropy (MaxEnt) models, and Naive
“tryin to cure this cabin fever!” and “Thanks to my Bayes. We experimented with both the full fea-
kids for cleaning up the yard” (edited to preserve pri- ture set of unigrams, as well as a truncated set us-
vacy). Without context, it is difficult to determine ing standard feature selection techniques: removing
whether these tweeters were dealing with hurricane- rare words (frequency below 3) and selecting the n
related issues. words with the highest pointwise mutual informa-
Agreement was higher for fine-grained tagging tion between the word counts and document labels.
(κ = .814). The hardest categories were the rarest Each option was evaluated on the development
(Preparation and Movement), with most confusions data. Feature selection was substantially better than
between Preparation, Reporting, and Sentiment.1 using all unigrams, with the SVM yielding the best
F1 performance. For the remaining experiments,
SVM with feature selection was used.
4 Classification
4.2 Features
We trained binary classifiers for each of the cate- In addition to unigrams, bigram counts were added
gories in Table 1, using independent classifiers for (using feature selection described above), as well as:
each of the fine-grained categories (for which a
tweet may have none, or multiple). • The time of the tweet is particularly relevant to
the classification, as tweets during and after the
1
Dataset available at https://github.com/kevincstowe/chime- event are more likely to be relevant than those
annotation before. The day/hour of the tweet is represented

3
Figure 1: Negated difference in F1 for each feature removed from the full set (positive indicates improvement).

as a one-hot feature vector. Baseline All Features Best Features


F1 P R F1 P R F1 P R
• We indicate whether a tweet is a retweet (RT), Relevance .66 .80 .56 .71 .81 .64 .72 .79 .66
Actions .26 .44 .19 .39 .46 .35 .41 .42 .40
which is indicative of information-sharing rather Information .33 .57 .24 .48 .57 .41 .49 .50 .49
than first-hand experience. Movement .04 .04 .04 .07 .10 .07 .08 .10 .07
Preparation .30 .44 .23 .36 .41 .32 .36 .38 .35
• Each URL found within a tweet was stripped to Reporting .52 .76 .40 .73 .71 .75 .75 .71 .80
its base domain and added as a lexical feature. Sentiment .37 .64 .26 .53 .58 .49 .52 .52 .52

• The annotators noted that context was important Table 2: Results for relevance and fine-grained classification.
in classification. The unigrams from the previous
tweet and previous two tweets were considered
uational awareness vs not. We used these four
as features.
Verma classifiers to tag our Hurricane Sandy
• We included n-grams augmented with their part- dataset and included these tags as features.
of-speech tags, as well as named entities, using
the Twitter-based tagger of Ritter et al. (2011). 4.3 Classification Results
• Word embeddings have been used extensively Classification performance was measured using five-
in recent NLP work, with promising results fold cross-validation. We conducted an ablation
(Goldberg, 2015). A Word2Vec model (Mikolov study (Figure 1), removing individual features to de-
et al., 2013) was trained on the 22.2M tweets col- termine which contributed to performance. Table 2
lected from the Hurricane Sandy dataset, using shows the cross-validation results using the baseline
the Gensim package (Řehůřek and Sojka, 2010), feature set (selected unigrams only), all features, and
using the C-BOW algorithm with negative sam- the best feature set (features which had a significant
pling (n=5), a window of 5, and with 200 dimen- effect in the ablation study). In all categories except
sions per word. For each tweet, the mean embed- for Movement, the best features improved over the
ding of all words was used to create 200 features. baseline with p < .05.
• The work of Verma et al. (2011) found that for-
mal, objective, and impersonal tweets were use- 4.4 Performance Analysis
ful indicators of situational awareness, and as Time, context, and word embedding features help
such developed classifiers to tag tweets with four relevance classification. Timing information is help-
different categories: formal vs informal, subjec- ful for distinguishing certain categories (e.g., Prepa-
tive vs objective, personal vs impersonal, and sit- ration happens before the storm while Movement

4
can happen before or after). Context was also help- Verma Acc Ext. Acc Verma F1 Ext. F1
ful, consistent with annotator observations. A larger SA .845 .856 .423 .551
context window would be theoretically more use- Table 3: Verma Comparison
ful, as we noted distant tweets influenced annota-
tion choices, but with this relatively small dataset
increasing the context window also prohibitively in- scribed above. We replicated the original Verma et
creased sparsity of the feature. al. (2011) model with similar results, and then ad-
Retweets and URLs were not generally useful, justed the model to incorporate features that per-
likely because the information was already captured formed positively from our experiments to create an
by the lexical features. Part-of-speech tags yielded ’extended’ model. This entailed adding the mean
minimal improvements, perhaps because the lexical word embeddings for each tweet as well as adjust-
features critical to the task are unambiguous (e.g., ing the unigram model to incorporate only key terms
“hurricane” is always a noun), nor did the addition of by PMI. They report only accuracy, which our sys-
features from Verma et al. (2011), perhaps because tem improves marginally, while making this modifi-
these classifiers had only moderate performance to cations greatly improved F1, as shown in table 3.
begin with and were being extended to a new do-
main.
Fine-grained classification was much harder. Lex- 5 Conclusion
ical features (bigrams and key terms) were useful for
most categories, with other features providing mi-
nor benefits. Word embeddings greatly improved Compared to the most closely related work of Verma
performance across all categories, while most fea- et al. (2011), our proposed classifiers are both more
tures had mixed results. This is consistent with our general (identifying all relevant tweets, not just situ-
expectations of latent semantics : tweets within the ational awareness) and richer (with fine-grained cat-
same category tend to contain similar lexical items, egorizations). Our experimental results show that
and word embeddings allow this similarity to be cap- it is possible to identify relevant tweets with high
tured despite the limited size of the dataset. precision while maintaining fairly high recall. Fine-
The categories that were most confused were In- grained classification proved much more difficult,
formation and Reporting, and the categories with the and additional work will be necessary to define ap-
worst performance were Movement, Actions, and propriate features and models to detect more specific
Preparation. Movement simply lacks data, with only categories of language use. Data sparsity also causes
53 labeled instances. Actions and Preparation con- difficulty, as many classes lack the positive examples
tain wide varieties of tweets, and thus patterns to dis- necessary for the machine to reliably classify them,
tinguish them are sparse. More training data would and we continue to work on further annotation to al-
help fine-grained classification, particularly for Ac- leviate this issue.
tions, Preparation, and Movement. Our primary research aims are to leverage both
Classification for Reporting performs much better relevance classification and fine-grained classifica-
than others. This is likely because these tweets tend tion to assist crisis managers and first responders.
to fall into regular patterns: they often use weather The preliminary results are show that relevant in-
and environment-related lexical items like “wind” formation can be extracted automatically via batch
and “trees”, and frequently contain links to images. processing after events, and we aim to continue ex-
They also are relatively frequent, making their pat- ploring possibilities to extend this approach to real-
terns easier to identify. time processing. To make this research more appli-
cable, we aim to produce a real-time processing sys-
4.5 Performance in Other Domains
tem that can provide accurate classification during
To see how well our methods work on other datasets, an event rather than after, and the apply current re-
we compared our model to the situational awareness sults to other events and domains.
classification in the Verma et al. (2011) datasets de-

5
References Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. tions of words and phrases and their compositionality.
2011. Twitter catches the flu: Detecting influenza In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-
epidemics using Twitter. In Conference on Empirical mani, and K. Q. Weinberger, editors, Advances in Neu-
Methods in Natural Language Processing (EMNLP), ral Information Processing Systems 26, pages 3111–
pages 1568–1576. 3119. Curran Associates, Inc.
Hila Becker, Mor Naaman, and Luis Gravano. 2010. Bo Pang and Lillian Lee. 2008. Opinion mining and
Learning similarity metrics for event identification in sentiment analysis. Foundations and trends in infor-
social media. In ACM International Conference on mation retrieval, 2(1-2):1–135.
Web Search and Data Mining (WSDM), pages 291– Radim Řehůřek and Petr Sojka. 2010. Software Frame-
300. work for Topic Modelling with Large Corpora. In Pro-
K J Bennett, J M Olsen, S Harris, S Mekaru, A A Livin- ceedings of the LREC 2010 Workshop on New Chal-
ski, and J S Brownstein. 2013. The perfect storm of lenges for NLP Frameworks, pages 45–50, Valletta,
information: combining traditional and non-traditional Malta, May. ELRA.
data sources for public health situational awareness Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named
during hurricane response. PLoS Curr, 5. entity recognition in tweets: an experimental study. In
Aron Culotta. 2010. Towards detecting influenza epi- Proceedings of the Conference on Empirical Methods
demics by analyzing twitter messages. In Proceed- in Natural Language Processing, pages 1524–1534.
ings of the First Workshop on Social Media Analyt- Jagan Sankaranarayanan, Hanan Samet, Benjamin E
ics, SOMA ’10, pages 115–122, New York, NY, USA. Teitler, Michael D Lieberman, and Jon Sperling. 2009.
ACM. Twitterstand: news in tweets. In Proceedings of the
Julie L. Demuth, Rebecca E. Morss, Betty Hearn Mor- 17th acm sigspatial international conference on ad-
row, and Jeffrey K. Lazo. 2012. Creation and commu- vances in geographic information systems, pages 42–
nication of hurricane risk information. Bulletin of the 51. ACM.
American Meteorological Society, 93(8):1133–1145. Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Fer-
Son Doan, Bao Khanh Ho Vo, and Nigel Collier. 2012. hatosmanoglu, and Murat Demirbas. 2010. Short text
An analysis of Twitter messages in the 2011 Tohoku classification in twitter to improve information filter-
earthquake. In Lecture Notes of the Institute for Com- ing. In Proceedings of the 33rd international ACM
puter Sciences, Social-Informatics and Telecommuni- SIGIR conference on Research and development in in-
cations Engineering, volume 91 LNICST, pages 58– formation retrieval, pages 841–842. ACM.
66. Sudha Verma, Sarah Vieweg, William J Corvey, Leysia
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Palen, James H Martin, Martha Palmer, Aaron
ter sentiment classification using distant supervision. Schram, and Kenneth Mark Anderson. 2011. Natural
CS224N Project Report, Stanford, 1:12. language processing to the rescue? extracting” situa-
Yoav Goldberg. 2015. A primer on neural net- tional awareness” tweets during mass emergency. In
work models for natural language processing. CoRR, ICWSM.
abs/1510.00726. Sarah Vieweg, Amanda L. Hughes, Kate Starbird, and
Muhammad Imran, Carlos Castillo, Fernando Diaz, and Leysia Palen. 2010. Microblogging during two nat-
Sarah Vieweg. 2014. Processing social media mes- ural hazards events: what Twitter may contribute to
sages in mass emergency: A survey. arXiv preprint situational awareness. In CHI.
arXiv:1407.7071. Jie Yin, Andrew Lampert, Mark Cameron, Bella Robin-
son, and Robert Power. 2012. Using social media to
Marina Kogan, Leysia Palen, and Kenneth M Anderson.
enhance emergency situation awareness. IEEE Intelli-
2015. Think local, retweet global: Retweeting by the
gent Systems, 27(6):52–59.
geographically-vulnerable during Hurricane Sandy. In
ACM Conference on Computer Supported Cooperative
Work & Social Computing (CSCW).
Efthymios Kouloumpis, Theresa Wilson, and Johanna
Moore. 2011. Twitter sentiment analysis: The good
the bad and the omg! ICWSM, 11:538–541.
Alex Lamb, Michael J Paul, and Mark Dredze. 2013.
Separating fact from fear: Tracking flu infections on
twitter. In HLT-NAACL, pages 789–795.

You might also like