You are on page 1of 8

The Workshops of the Tenth International AAAI Conference on Web and Social Media

Social Media in the Newroom: Technical Report WS-16-19

Just the Facts with PALOMAR:


Detecting Protest Events in Media Outlets and Twitter
Konstantina Papanikolaou1, Haris Papageorgiou1, Nikos Papasarantopoulos1,
Theoni Stathopoulou2, George Papastefanatos3
1 Institute for Language and Speech Processing/Athena RC, Athens, Greece
2 National Center of Social Research, Athens, Greece
3
Athena RC, Athens, Greece
{konspap, xaris, npapasa}@ilsp.gr, theosta@ekke.gr, gpapas@imis.athena-innovation.gr

Abstract Formerly, the main source for events discovery were the
The volume and velocity of available online sources have various news agencies. With the advent of Social Media
changed journalistic research in terms of cost and effort re- (SM), the landscape has drastically changed reformulating
quired for discovering stories. However, the heterogeneity the way people share information and communicate
and veracity of data sources pose further obstacles in
worldwide. SM data constitutes an intriguing source for
knowledge extraction making it a hard task to handle. The
purpose of this study is threefold. Firstly, we present a plat- journalists, as it can help them discover interesting topics
form for automated data processing in the context of Com- for articles, check the validity and the veridicality through
putational Journalism. We then propose a general method- different sources, measure news stories ideological lean-
ology for event extraction from different data sources. Fi- ings, detect controversy, analyze event spread and impact
nally, we conducted a pilot implementation of our method-
or forecast civil unrest. However, the processing of this
ology for protest events extraction from news and Twitter
data. Evaluation showed promising results, indicating the data is not easy to handle. Since natural language is in-
feasibility of our approach. volved, special tools are needed for event discovery. Fur-
thermore, social media content is noisy, in terms of a great
percentage of non-specific or trivial information posts.
1 Introduction The contribution of this paper is threefold: (a) to present
Event extraction has been a very critical yet challenging the design and the big data architecture of PALOMAR, an
task for Information Extraction systems. It responds to automated Computational Journalism platform for scalable
needs that trigger interdisciplinary research with the goal processing of data streams of news sources and social me-
of automating the collection and curation of knowledge dia, (b) to describe an innovative, semi-supervised meth-
that is related to certain domains. Nowadays, it is being odology for the detection of certain event types together
extensively used by social and political sciences (Schrodt with their structural components. Here, we propose a data-
and Van Brackle 2013), as well as in the biomedical field driven yet linguistically based framework grounded on
(Ananiadou et al. 2010); at the same time events are the social and political sciences incorporating a human-in-the-
principal focus of journalists in their daily work. Many loop. (c) To report on a pilot implementation of the meth-
approaches have been reported, spanning machine learn- odology concerning a longitudinal study of protest events
ing, statistical, and knowledge-driven approaches which enabling the connection and analysis of protests in news
are mainly pattern-based. Despite the proliferation of re- and social media. The outcome of this analysis is subse-
search efforts, it remains a non-trivial task considering its quently visualized and linked in an intuitive way. Moreo-
possible applications and analysis possibilities, given the ver, an internal and external evaluation of the pilot is also
emergence of data sources requiring novel ways of curat- documented.
ing. Thus, journalists can use PALOMAR to exploit hetero-
geneous sources and discover interesting topics for articles,
Copyright © 2016, Association for the Advancement of Artificial Intelli- examining possible bursts emerging in SM that are not
gence (www.aaai.org). All rights reserved. recorded by news data or vice versa. This is facilitated by

135
the multiple visualizations that the journalists themselves PALOMAR is a service-oriented infrastructure offer-
can produce in an interactive tool. ing an operational platform to enable the execution of text
The paper is structured as follows: the system architec- mining services in a seamless, interoperable way, through
ture is described in section 2, and the event extraction the use of existing cloud based infrastructure. On the Data
methodology in section 3. In section 4, details of the pilot Repository side, the PALOMAR infrastructure is powered
implementation are provided. Finally, Related Work (sec- by the ELK stack1 (Elastic, Logstash, Kibana), a platform
tion 5) and Conclusions (section 6) are discussed. for indexing, searching and analyzing data 2. In addition,
extracted metadata and generated analysis results are also
archived and curated. The Data Repository is initialized
2 System Architecture with private and public datasets. The Data Analytics Lay-
The objectives as described in the introductory section are er integrates Language Technology tools and workflows
traditionally approached via costly and non-reproducible for processing the datasets curated in the Data Repository
coding of documents. However, this type of coding does and generating a wealth of valuable metadata from these
not scale when confronted with the vast amount of textual sources, including the extraction of specific events, actors,
material that is available in news archives or generated topics, quotations and opinions, according to coding
daily in the social media. The requirements for journalists frameworks like the PROMAP (Stathopoulou forthcoming)
who would code materials of such magnitude is prohibitive described in section 4.1. PALOMAR s third layer, the
in terms of cost and size. We therefore develop PALO- Modeling Dashboard is the place where content-based
MAR in order to support these objectives by employing metadata and network analysis are synthesized in order to
innovative technologies that will help journalists extract provide journalists with diverse ways of observing evolv-
historical evidence from large textual collections and equip ing patterns and trends. Finally, evaluation of the analysis
them with robust tools for providing explanations to ques- is offered to journalists in the form of intuitive, exportable
tions of interest. visualizations providing insights on the correlation be-
The PALOMAR Data Analysis and Modeling Platform tween diverse variables of interest, the dynamics of events
is an innovative, automated Computational Journalism on social media streams, etc. The visualization compo-
platform encompassing various scientific instruments, a nent3 includes timelines of configurable temporal win-
wealth of datasets enriched with metadata automatically dows, information maps, the ability to zoom in-out to spe-
produced by reliable analytics workflows and key insights cific textual collections or to filter results using sub-queries
revealed by modeling and visualization tools. Figure 1 etc., which help journalists understand the dynamics of
provides a conceptual overview of the architecture of the events at different dimensions.
platform together with its key systems and components.
3 Event Extraction
As mentioned above, the Data Analytics Layer activates
scalable text analytics workflows in order to extract
knowledge graphs including domain-specific events among
others. The event extraction methodology we propose con-
sists of the following phases:
At first, the coding framework covering a wide spectrum
of event types is designed by news experts. The codebook
we used for the pilot formalizes the topology of certain
event types of interest and it is grounded on related work in
the political and social sciences (cf. section 5). In parallel,
the Data Repository caters for the collection and the cura-
tion of the data sources including data cleansing, pre-
processing and preparation for further analysis. At the next
phase, journalists interact with the pre-processed data
through the Data Repository stack in order to explore the
instantiation of the event types and their structural compo-
nents, filter them according to the scope of their research,

1
Figure 1: The PALOMAR architecture http://bit.ly/20VWsvi
2
http://palomar.ilsp.gr/palomar
3
http://bit.ly/1Q1WUVU

136
generate event-oriented data clusters, while at the same The PROMAP Codebook was primarily based on the
time enriching and improving the codebook in an iterative codebook used for Europub project5 on the transformation
fashion. Based on the improved codebook and the event- of political mobilization and communication in European
oriented data clusters, a scalable, semi-supervised text public sphere (Koopmans 2002). According to the code-
workflow leverages linguistic structure in order to auto- book, the analysis unit for mapping protest events, is the
matically extract event instances and create the event data- Claim which consists of six distinct elements: Form, Ac-
base. The PALOMAR modeling dashboard with its visual- tor, Addressee, Issue, Location and Time. From an event
ization component facilitates journalists in their explora- extraction point of view, the Claim is an event tuple com-
tion of the database, providing events in context and mak- prising six different information types. The essential ele-
ing them understandable and interpretable for their re- ment of each Claim is the Form, which represents a form
search. Finally, the automatically generated event database of action, i.e. an event type. 19 Forms were examined, in
is further statistically analyzed and evaluated against a particular: Policing, Petitioning/Signature collection,
small testing set. Strike, Half-day strike, Work-to-rule, Abstention from du-
A pilot implementation of the above described method- ties, Withholding of labor, March/Demonstration/Public
ology was conducted concerning protest events and aiming assembly, Motorized march, Sit-down, Suspension of ser-
to evaluate the results so as to further extend it to other vice, Boycott, Disturbance of meetings, Blockade, Occupa-
event types. This implementation took place in two stages. tion/Sit-in, Hunger strike, Prison riot, Symbolic violence
Initially, in collaboration with social scientists, protest and Arson, bomb attacks, destruction of property.
events were extracted from news data using a linguistically As far as the rest of the elements are concerned, Actor is
based approach, which demonstrated significant evaluation classified into categories according to its role or status at
results. This gave us the impetus to experiment with a the time the Claim is formulated, e.g. Government, Enter-
completely different data source, namely Twitter, for the prises, Tertiary Trade Unions. Addressee, is categorized
same purpose, extracting protest events. Our event extrac- the same way as Actor. The Issue, which is the subject
tion workflow consists of the following processes: (i) at a matter of protest, is recorded as found in the text, Location,
first stage, the codebook was developed, containing the refers to the city, or the exact place where an event took
protest events taxonomy and their complementary ele- place, with the limitation of that being in Greece. Lastly,
ments. (ii) Data were collected from different sources and Time, is the event s date.
ETL transformations turned them into a human readable All of the above information types comprised the event
corpus enabling researchers to efficiently and effectively template, namely a tuple of the following type:
investigate it. (iii) Event Extraction includes pre- <Form, Actor, Addressee, Issue, Time, Location>
processing and a cascade of transducers implemented as However, not all of the elements were needed to record
Gate JAPE4 patterns which identify events and their con- an event. The crucial element, as mentioned, was the Form
stituents. (iv) Finally, results were visualized in the Palo- and at least one of the {Actor, Addressee, Issue}. Time was
mar dashboard. This workflow is elaborated in the next by default the article s issue date and Location was option-
section. al information, recorded only when found in the text, oth-
erwise Greece was used as the default value.

4 Pilot Implementation 4.2 Data Collection


The above proposed methodology, was implemented in the 4.2.1 News Data
context of a project whose main goal was the mapping of The news dataset consists of articles published in two
protest events in Greece for the period 1996-2014. First, a Greek newspapers, Kathimerini6 and Avgi7; specifically, all
Codebook (section 4.1) for this category of events was articles of the Wednesday and Friday issues, from 1996 to
designed. Then, data were collected from newspapers and 2014. All the articles are in Greek and, for each one, along
Twitter (section 4.2), filtered (section 4.3), and analyzed with the text, section labels, headlines and the names of the
(section 4.4) accordingly. Finally, the generated Event da- authors were collected. The dataset comprises 540.989
tabase was internally and externally evaluated (section articles in total, 314.527 from Kathimerini and 226.462
4.6). from Avgi.
4.2.2 Twitter Data
4.1 Coding Schema The Twitter dataset consists of tweets from 2013 and 2014.
A coding framework for protest events entitled PROMAP All tweets posted in the Greek language were downloaded
was designed by the National Center of Social Research.
5
http://europub.wzb.eu/
6
http://www.kathimerini.gr/
4
https://gate.ac.uk/sale/tao/splitch8.html#chap:jape 7
http://www.avgi.gr/

137
making use of the twitter API and they were stored and a basic natural language processing workflow was applied
indexed together with their respective metadata (user in- producing a set of annotations. After that, the Event Analy-
formation, retweet information etc.). The total number of sis workflow executes two tasks in sequence. First, the
tweets processed was 166.100.543. structural components of the event template are detected,
and then linguistic rules that exploit all the previous anno-
4.3. Explorative Analysis tations attempt to link the right components to create the
Data exploration is an integral part of the methodology, event tuples with which the final Event Database is con-
since our approach is data-driven and incorporates human- structed. The system comprises a set of FST rules (Finite
in-the-loop. Consequently, expert users explored the bulk State Transducers) which exploit multiple linguistic anno-
of data gathered in order to determine the lexicalizations of tation streams already produced by previous tools. The
the various event types and their components. This part FST rules are ordered, forming a cascade, so that the out-
was crucial for filtering the data and clustering them into put of a transducer is given as input to the next. The work-
targeted collections. This was an interactive, iterative pro- flow of the event detection system is illustrated with the
cedure, since the exploration was driven by the Codebook following indicative example:
(see Section 4.1), and the latter was modified according to Employees at the movement of goods at the premises of
the findings from the data exploration. Cosco, in Piraeus, are on a strike since this morning, pro-
testing about working conditions.
4.3.1 News Data
As already mentioned, in the first stage, raw text is the
The main goal of the explorative analysis step was to better
input to the NLP pre-processing tools (Prokopidis, Geor-
understand and obtain a broader view of the whole dataset.
gantopoulos and Papageorgiou 2011) which produce anno-
To achieve this, a full text search application was devel-
tations for POS tags (e.g. noun), Lemmas, Chunks (e.g.
oped.
prepositional phrases), Dependency relations (e.g. Subject)
Sections of the newspapers articles that were considered
and Named Entities classified into four categories: Person,
irrelevant to the research scope were filtered out: the scope
Organization, Location and Facility. For example, in the
of the research was Greece, so international news were
above excerpt, NERC would recognize Cosco and label it
excluded. Moreover, sports or culture sections were con-
as Organization. The next module, Nominals Phase, ex-
sidered out of scope as well. This procedure limited the
ploits the POS tags, Lemmas and Chunks to annotate enti-
number of articles from both newspapers to 362.884.
ties that are not named, yet their nominal expression is
After that, analysts were able to make full-text queries,
present in the text. In the example, employees at the move-
select articles, examine them and save their search. Later
ment of goods would be annotated as Nominal. These enti-
on, in the data analysis phase, the queries and the docu-
ties (Persons, Organizations and Nominals), are assigned
ments that were marked as relevant were retrieved to con-
the label Candidate as they all are entities that can poten-
struct document clusters referring to one event type.
tially be Actor or Addressee. Time expressions and Issue
4.3.2 Twitter Data are located next and annotated as such. It is noteworthy
The exploration step for the twitter dataset was similar to that Issue, namely the subject matter of the protest, is heav-
that of the news data. The PALOMAR platform was used ily dependent on semantics. Consequently, patterns con-
in order to index the tweets and make the dataset available taining trigger words along with their syntactic comple-
to the analysts. Again, the ability to save queries which ments were used for its detection. In such a pattern, a trig-
were later used to make clusters of tweets on event types is ger word is protest and its syntactic complement a prep-
the core functionality of the interface. In addition, it pro- ositional phrase starting with about . Thus, in the exam-
vides enhanced functionalities for storage, processing and ple, working conditions would be recognized as an Issue.
visualization (see Section 2). Next, the element Form of action is detected. It is im-
portant to note that we address both verbal and nominal
4.4. Event Extraction event extraction, so all possible lexicalizations are record-
The overall protest events extraction framework is data- ed. Furthermore, every such annotation is attributed with
driven. For the news data, the approach is mainly based on certain features, in terms of the semantic information it
linguistic rules, while for Twitter data the special feature of conveys. In our example are on a strike is annotated as
hashtags is exploited. Both methods are described in the Form and given the attribute Strike so that it can be distin-
following sections. guished and later used in linguistic patterns. Finally, a set
of linguistic rules of shallow syntactic relations patterns are
4.4.1 News Data
implemented. These rules utilize the semantic information
In general, the rationale for extracting events from news
of events and their structural components combined with
data is bootstrapping, in terms of building over previously
syntactic information so that Candidate entities can be
produced annotations. In brief, at the pre-processing stage,
assigned an Actor or an Addressee label and link all the

138
components to create a tuple at the sentence level. In the ring to the specific event type, i.e. Strike. The next step
above example, the event tuple would be as follows: was to detect the Candidate entities and Location, based on
<Actor: Employees at the movement of goods, Form: predefined word lists as well. Finally, Candidates were
are on a strike, Addressee: Cosco, Issue: working condi- assigned the label Actor or Addressee with the same re-
tions, Time: 07/18/2014, Location: Piraeus> strictions used for the news data (see Section 4.4.1). All the
It is important to note that a small, random sample of extracted components populated the event tuples recorded
every dataset created in the phase of Data Exploration, was to the Twitter Event Database. For the Time element, the
used as a development corpus, in terms of testing and im- date the tweet was posted was used. This decision was
proving the linguistic patterns and rules against it. Moreo- made for two reasons: (a) due to the size limit of tweets,
ver, this corpus was annotated by humans three social time expressions are not very often, and (b) tweets are
scientists and used to evaluate and enhance the system s messages commenting current events.
performance before applying it to the whole news corpus. The seed terms for the detection of event instances con-
Post-processing: At the post-processing phase, the pro- sisted of the lexicalizations of strikes as derived from the
cessing window is expanded to the sentence where a Form news data analysis. In particular, the query terms from the
was detected 1. If a Form annotation has not been suc- Data Exploration, which were also an integral part of the
cessfully linked to an Actor and/or Addressee, this module Events Grammar, were converted into hashtags in three
attempts to find a Candidate entity and assign one of these different ways: a) in Greek, as found in newspapers, e.g.
labels to it, using rules and restrictions related to the fea- aperjía ( strike) b) translated, namely using
tures of the Candidate entity. For example, an entity that is the English word #strike, c) transliterated, that is convers-
categorized as Government, corresponds to the government ing the text from the Greek script to Latin, i.e. #apergia.
and its representatives and cannot be the Actor of a Strike Moreover, wildcards were used in order to discover com-
event type. pound hashtags containing terms in question, such as
4.4.2 Twitter Data #general_strike. In this way, the pool of tweets was filtered
Twitter as a dataset exhibits some completely different and only the ones referring to a strike were gathered into a
features compared to news. One of the most important is collection consisting of 4002 tweets. Nevertheless, this was
the limit of 140 characters, which enforces users to con- not sufficient for recording an event since according to
dense the message they want to communicate. In order to the Codebook apart from the Form at least one of the
do this, they use special elements such as hashtags, user other constituents (Actor, Addressee) is required for a
mentions or URLs. Additionally, words that do not influ- Claim to be recorded.
ence the general meaning (e.g. function words) are often In the next phase, the system detected the entities men-
omitted, making it hard for the standard NLP tools to pro- tioned in the tweets. To do this, the NERC s Organization
cess. Given that Twitter-specific NLP tools have not yet and Person Gazetteers were enriched with nominal terms
been fully developed for the Greek language, other ways of (see Section 4.4.1) that were used as search terms. Corre-
extracting information need to be used. Specifically, we spondingly to the Form annotation type, these lists were
followed a quite simple - yet effective- workflow of event also translated and transliterated, so as to cover all the pos-
extraction from Twitter that exploits a distinctive and sible variations. Eventually, by using simple rules, the
commonly used feature, namely the hashtags. events that were linked to at least one entity, viz. the Actor
The event type of Strike was the one we chose to exper- or the Addressee, were recorded to the event database
iment with in the Twitter dataset, due to its frequency in along with their structural constituents.
the news data. The method used for extracting such events The Twitter workflow is illustrated by the next example
from Twitter was semi-supervised. Our concern was to tweet.
make news and Twitter results comparable, therefore we #24-hour_strike #adedu tomorrow Tuesday 14 May
decided to extract similar event tuples, containing the same 2013 #athens http://t.co/zov0cl1ka0.
information types. Nonetheless, the Issue type was not The extracted event tuple would be:
addressed at this stage, because of the semantic and syntac- <Form: 24-hour_strike, Actor: adedu, Time: Tuesday
tic processing it requires. In any case, some first observa- 14 May 2013, Location: athens>
tions and hypotheses already made will be further explored
in future work. 4.5 Data Visualization
Similarly, to the news data event extraction, the goal The last step in data processing concerned the actual data
here is to first detect the different information types and visualization. For that, we have employed the Socioscope
then link the constituents to create an event tuple. Howev- platform8, an interactive web-based visual analytics plat-
er, the implementation is different as follows: in a first
stage, seed terms were used to filter out all the tweets refer-
8
www.socioscope.gr

139
form for social and political data that enables the analyst to A simple graph (Fig. 3) depicting the events reported in
explore and analyze social facts through a user-friendly GDELT event database and the ones recorded in the
visual interface. The Socioscope platform offers a variety PROMAP project, highlights a significant difference in
of interactive visualizations for different types of data, coverage, favoring PROMAP. This is the case through the
such as charts and histograms, pies and stacked diagrams whole timeline, with an exception of the period 2010-2012.
for numerical data, timelines for time series and choropleth
and point maps for geographical data. It is based on a mul-
tidimensional modelling approach and offers several visual
operations for data exploration and analysis, such as filter-
ing through faceted browsing, hierarchical representation
of coded lists in charts, free keyword search of literal val-
Figure 3: GDELT vs PROMAP results
ues, and capabilities for combination of different datasets
along common dimensions. Moreover, it enables the reus-
ability of knowledge by making all data available for
4.7 Results
download in various formats as well as in the form of Regarding the events extracted from the newspapers, quali-
Linked Open Data, which is a standard means for data tative analysis allowed some interesting readings. Among
sharing on the web, enabling citation and unique referenc- them, it is worth mentioning a correlation between election
ing across sites. years and number of events recorded. In particular, a nota-
ble decrease of the total number of claims is observed in
4.6 Evaluation election years. It would be interesting to verify if this is the
case for Twitter mentions as well, yet Twitter is a newly
An intrinsic and extrinsic evaluation of the events extrac-
emerged medium and there is no data before 2007.
tion system was conducted. In the intrinsic evaluation, the
One of our main goals was to correlate the events ex-
Strike event type and a specific month, viz. 2/2014 were
tracted from different sources, i.e. news and Twitter. For
selected. To measure the precision, we used the number of
this purpose, we created a timeline for both results at a
correct extracted instances. To calculate this number, we
monthly level. Figure 4 represents the results for 2013 and
counted as false positives tuples where a wrong event was
Figure 5 for 2014.
recognized (e.g. a hunger strike instead of a strike), or the
Actor/Addressee was conceptually erroneous, or for cases
where the tuple comprised only the event and an Issue
the Issue was wrong or incomplete. To measure the recall,
we manually annotated all of the articles from that period.
For Twitter, precision was estimated from the extracted
events, while for measuring the recall only the tweets con- Figure 4: News vs Twitter timeline for strikes (2013)
taining at least one of the seed terms were analyzed. The
evaluation results for both approaches are presented in the
table below.
News Twitter
Precision 90% 97.5%
Recall 93% 92% Figure 5: News vs Twitter timeline for strikes (2014)

Table 1: News and Twitter results evaluation Starting with the 2013 timelines, it was noted that event
mentions follow a similar pattern. More interesting obser-
An extrinsic evaluation was also conducted for the news vations come from the 2014 comparison: first, there is a
data. Precision was estimated by counting the correct rec- spike of strikes reported in Twitter in February, compared
ords of the whole Event base for the event category to newspapers; this holds almost throughout the year.
Strike . For recall, on the other hand, GDELT was used In an attempt to explain this inconsistency, and after tak-
as baseline. More specifically, we extracted an event data- ing a closer look to the results, we noted that more than
base from GDELT using the following criteria: the event 80% of the strike mentions in 2014 refer to one long-
code 143 (Strike and Boycott), for the period 1996-2014, lasting strike, undertaken by the employees of Coca Cola,
with the condition that at least one of the Event Location, when the latter decided to shut down the factory in Thessa-
Actor1 Country or Actor2 Country is Greece. This evalua- loniki. This strike lasted for more than a year and so do the
tion method resulted to a precision of 80%. GDELT fol- mentions of it in Twitter. What would be more interesting
lows a different coding framework from ours and matching is to look deeper into the users that post all these tweets
distinct events was out of scope for this study. and the ways they are connected to each other. A social

140
network analysis could reveal the extent to which there is and Walther and Kaisser (2013). In Weng et al. (2011),
coordination of the users towards a certain end. In sum, signals are built for each word in a tweet and then correlat-
comparing event mentions in news and Twitter data, is a ed to form a distinct event. Finally, in the context of protest
useful tool for journalists, since they can discover interest- events extraction, Zachary et al. (2015) examine mass pro-
ing aspects of events to write for. test that can lead to political changes at a country level,
using popular hashtags and measuring the extent to which
Twitter users coordination within social networks, can
5 Related Work cause collective action. The different scope and evaluation
Event extraction for political and social science has been a methodology of the above systems make it difficult to
long-standing topic, dating back to hand coding data. Work compare their performance. However, most of them report
on automatic annotation started within the KEDS/TABARI a precision ranging between 70 85%.
project (Shrodt, Shannon and Weddle 1994). Evaluations All of the above mentioned projects implement method-
have shown that hand coded and automatic events coding ologies varying from completely unsupervised, purely da-
show comparable performance (King and Lowe 2003). ta-driven approaches, to knowledge-driven methods based
Several coding schemes have been developed since, in- on domain experts. Hybrid frameworks have also been
cluding the IDEA (Bond et al. 2003) and ICEWS (O Brien used (Hogenboom et al. 2011). Our approach is a hybrid
2012). One of the most renown and influential frameworks method with human-in-the-loop.
for event extraction is CAMEO (Gerner et al. 2003), which Moreover, other research efforts have focused on devel-
is still used by the ongoing GDELT project9 (Leetaru and oping platforms for information extraction, gathering and
Shrodt 2013). All of these efforts have focused on news visualizing, addressing mainly to journalists. For instance,
data, that have traditionally been the main events source. in Marcus et al. (2011) TwitInfo is presented, viz. a system
Our codebook, PROMAP, follows the same principles with for identifying events in Twitter and visualizing them
a linguistically-driven implementation. Protest Events along with a sentiment aggregation of the corresponding
Analysis has been a central issue in the context of Political tweets. SocialSensor (Diplaris et al. 2012) also aims to
and Social sciences (Wueest, Rothenhäusler and Hutter provide users a tool for discovering trends, events and oth-
2013). Moreover, the use of event extraction for predictive er interesting information from online multimedia content.
analysis is a very challenging and far from trivial task, Diakopoulos, Choudhury and Naaman (2012) describe a
whose potential justifies its impact within the literature framework for processing tweets according to user queries
(Boschee, Natarajan and Weischedel 2013). and detecting eyewitnesses, while TweetGathering
Social media, and especially Twitter, have also been ex- (Zubiaga, Ji and Knight 2013) aspires to help journalists
tensively used for event extraction on many different topics discover news stories in Twitter.
and for various purposes such as, climate change (Olteanu
et al. 2015). The authors in (Wang, Fink and Agichtein
6 Conclusions
2015) discover social events, like concerts or conferences
and their structural components, i.e. location, time and title In this paper, an innovative platform for automated pro-
of the event, by linking the information from the tweet and cessing of data from different online sources is presented.
the embedded URL. In Becker, Naaman and Gravano The PALOMAR system allows journalists to analyze and
(2011), tweets are distinguished to event and non-event visualize various information types. Event extraction is
messages, with the first being clustered into topic catego- addressed first, for which a semi-supervised linguistically
ries. Ritter et al. (2012) extract event tuples from Twitter driven methodology is proposed. This methodology is im-
stream and classify them into topics, using an unsupervised plemented in a pilot task of extracting protest events along
approach. In Popescu, Pennacchiotti and Paranjpe (2011), with their structural components.
events concerning specific known entities are discovered As far as the pilot implementation on protest events is
and structured, using a supervised method to decide for the concerned, the evaluation showed significant results both
relevance of tweets. Li, Sun and Datta (2012) and Qin et al. for precision and recall. Also, results from the two differ-
(2013) both rely on text segments in order to detect and ent sources turned out to be comparable, in that protest
classify events in Twitter, with the first making use of events reports can be found both in newspapers as ex-
Wikipedia for the filtering of real events and the second pected and in Twitter. This could be quite helpful for
implementing feature clustering. Temporal and spatial in- journalists in verifying an event through multiple sources.
formation has also been used for identifying and categoriz- However, what is more interesting is the impact that cer-
ing events in Twitter, as in Parikh and Karlapalem (2013) tain events may have in social media. As mentioned Twit-
ter revealed a disproportionally large number of mentions
to a certain strike, which could be further examined in
9
http://gdeltproject.org/

141
terms of users network and whether there is coordination Li, C., Sun, A., Datta, A. 2012. Twevent: Segment-based event detection
and call for action. This analysis would be challenging, for from tweets. Proceedings of the 21st ACM international conference on
Information and knowledge management.
example, in cases of terrorism, attacks or social revolu-
Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R., Madden, S. and
tions. Future work includes expanding the functionality of
Miller, R. C. 2011. Twitinfo: aggregating and visualizing microblogs for
the above presented platform. It is our intention to extend event exploration. In Proceedings of the SIGCHI conference on Human
our analysis to other analytics tasks, such as topic and quo- factors in computing systems.
tations analysis and thus enrich PALOMAR and equip O Brien, S. 2012. A multi-method approach for near real time conflict
journalists with an interactive, multifunctional toolkit for and crisis early warning. In Subrahmanian V. (ed) Handbook on computa-
real-time analysis of multi-source data streams. tional approaches to Counterterrorism.
Olteanu, A., Castillo, C., Diakopoulos, N., Aberer, K. 2015. Comparing
Events Coverage in Online News and Social Media: The Case of Climate
7 Acknowledgements Change. In Proceedings of the International Conference on Weblogs and
Social Media.
This work is partially funded by the projects EOX Parikh, R., Karlapalem, K., 2013. ET: Events from Tweets. In Proceed-
GR07/3712 and Research Programs for Excellence 2014- ings of International Conference on World Wide Web (WWW) (Compan-
2016 / CitySense-ATHENA R.I.C. ion Volume). pp. 613 620.
Popescu, A. M., Pennacchiotti, M., Paranjpe, D. A. 2011. Extracting
Events and Event Descriptions from Twitter. In Proceedings of Interna-
8 References tional Conference on World Wide Web.

Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D. B. 2010. Event extraction Prokopidis, P., Georgantopoulos, B. & Papageorgiou, H. 2011. A suite of
for systems biology by text mining the literature. Trends in Biotechnology NLP tools for Greek. In The 10th International Conference of Greek
Linguistics. Komotini, Greece.
28: 7, pp. 381 390.
Becker, H., Naaman, M., Gravano, L. 2011. Beyond Trending Topics: PROMAP Codebook. Forthcoming. In Stathopoulou T. (ed.) Trasfor-
Real-World Event Identification on Twitter. In Proceedings of the Inter- mations of protest in Greece. (provisional title). Papazisis publishers,
Athens.
national Conference on Weblogs and Social Media.
Bond, D., Bond, J., Oh, C., Jenkins, J., Taylor, C. 2003. Integrated Data Qin, Y., Zhang, Y., Zhang, M., Zheng, D. 2013. Feature-Rich Segment-
for Events Analysis (IDEA): An Event Typology for automated events Based News Event Detection on Twitter. In Proceedings of the Sixth
International Joint Conference on Natural Language Processing.
data development. Journal of Peace Research 40(6):733 745.
Boschee, E., Natarajan, P., Weischedel, R. 2013. Automatic Extraction of Ritter, A., Mausam, Etzioni, O., Clark, S. 2012. Open Domain Event
Events from Open Source Text for Predictive Forecasting. In Subrahma- Extraction from Twitter. In Proceedings of the international conference
on Knowledge Discovery and Data mining.
nian V. (ed) Handbook on computational approaches to Counterterror-
ism. New York: Springer. Schrodt, P. and Van Brackle, D. 2013. Automated Coding of Political
Diakopoulos, N., Choudhury, M. De, Naaman, M. 2012. Finding and Event Data. In Subrahmanian V. (ed) Handbook on computational ap-
proaches to Counterterrorism. New York: Springer.
assessing social media information sources in the context of journalism.
In Proceedings of the SIGCHI Conference on Human Factors in Compu- Shrodt, P., Shannon, D., Weddle, J. 1994. Political Science: KEDS-A
ting Systems. 2451 2460. New York, USA: ACM. Program for the Machine Coding of Event Data. In Social Science Com-
puter Review.
Diplaris, S., Papadopoulos, S., Kompatsiaris, I., Goker, A.S., MacFarlane,
A., Spangenberg, J., Hacid, H., Maknavicius, L. & Klusch, M. 2012. Walther, M., and Kaisser, M. 2013. Geo-spatial event detection in the
SocialSensor: Sensing User Generated Input for Improved Media Discov- twitter stream. In Advances in Information Retrieval. 356 367. Springer
ery and Experience. In: Proceedings of the 21st international conference Berlin Heidelberg.
companion on World Wide Web. Wang, Y., Fink, D., Agichtein, E. 2015. SEEFT: Planned Social Event
Gerner, D., Schrodt, P., Yilmaz, O., Abu-Jabr, R. 2002. Conflict and Discovery and Attribute Extraction by Fusing Twitter and Web Content.
Mediation Event Observations (CAMEO): A New Event Data Framework In Proceedings of the International Conference on Weblogs and Social
for the Analysis of Foreign Policy Interactions. Media.
Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, F. 2011. An Over- Weng, J., Yao, Y., Leonardi, E., Lee, F. Event Detection in Twitter. HP
view of Event Extraction from Text. In M. van Erp, W. R. van Hage, L. Laboratories. HPL-2011-98.
Hollink, A. Jameson, and R. Troncy, (eds), Workshop on Detection, Rep- Wueest, B., Rothenhäusler, K., Hutter. S. 2013. Using Computational
resentation, and Exploitation of Events in the Semantic Web at Tenth Linguistics to Enhance Protest Event Analysis. In SSRN Electronic Jour-
International Semantic Web Conference. 779, 48 57. nal.
King, G., Lowe, W. 2003. An automated information extraction tool for Zachary, C. S., Mocanu, D., Vespignani, A., Fowler, J. 2015. Online
international conflict data with performance as good as human coders: a Social Networks and Offline Protest. EPG Data Science 4:19.
rare events evaluation design. International Organization 57(3):617-642.
Zubiaga, A., Ji, H. and Knight, K. 2013. Curating and contextualizing
Koopmans, R. 2002. Codebook for the analysis of political mobilisation twitter stories to assist with social newsgathering. In Proceedings of the
and communication in European public spheres. Available from: 2013 international conference on Intelligent user interfaces.
<http://europub.wzb.eu/codebooks.en.htm>.
Leetaru, K., Shrodt, P. 2013. GDELT: Global Data on Events, Language
and Tone, 1979 2012.

142

You might also like