You are on page 1of 64

Open Challenges for Data Stream Mining Research

Georg Krempl
Indre Zliobaite Dariusz Brzezinski

University Magdeburg, Germany Aalto University and HIIT, Finland Poznan U. of Technology, Poland
georg.krempl@iti.cs.uni-magdeburg.de indre.zliobaite@aalto.fi dariusz.brzezinski@cs.put.poznan.pl
Eyke Hullermeier
Mark Last Vincent Lemaire
University of Paderborn, Germany Ben-Gurion U. of the Negev, Israel Orange Labs, France
eyke@upb.de mlast@bgu.ac.il vincent.lemaire@orange.com
Tino Noack Ammar Shaker Sonja Sievi
TU Cottbus, Germany University of Paderborn, Germany Astrium Space Transportation, Germany
noacktin@tu-cottbus.de ammar.shaker@upb.de sonja.sievi@astrium.eads.net
Myra Spiliopoulou Jerzy Stefanowski
University Magdeburg, Germany Poznan U. of Technology, Poland
myra@iti.cs.uni-magdeburg.de jerzy.stefanowski@cs.put.poznan.pl

ABSTRACT the last decade, truly autonomous, self-maintaining, adaptive data


mining systems are rarely reported. This paper identifies real-world
Every day, huge volumes of sensory, transactional, and web data
challenges for data stream research that are important but yet un-
are continuously generated as streams, which need to be analyzed
solved. Our objective is to present to the community a position
online as they arrive. Streaming data can be considered as one
paper that could inspire and guide future research in data streams.
of the main sources of what is called big data. While predictive
This article builds upon discussions at the International Workshop
modeling for data streams and big data have received a lot of at-
on Real-World Challenges for Data Stream Mining (RealStream)1
tention over the last decade, many research approaches are typi-
in September 2013, in Prague, Czech Republic.
cally designed for well-behaved controlled problem settings, over-
looking important challenges imposed by real-world applications. Several related position papers are available. Dietterich [10] presents
This article presents a discussion on eight open challenges for data a discussion focused on predictive modeling techniques, that are
stream mining. Our goal is to identify gaps between current re- applicable to streaming and non-streaming data. Fan and Bifet [12]
search and meaningful applications, highlight open problems, and concentrate on challenges presented by large volumes of data. Zlio-
define new application-relevant research directions for data stream baite et al. [48] focus on concept drift and adaptation of systems
mining. The identified challenges cover the full cycle of knowledge during online operation. Gaber et al. [13] discuss ubiquitous data
discovery and involve such problems as: protecting data privacy, mining with attention to collaborative data stream mining. In this
dealing with legacy systems, handling incomplete and delayed in- paper, we focus on research challenges for streaming data inspired
formation, analysis of complex data, and evaluation of stream min- and required by real-world applications. In contrast to existing po-
ing algorithms. The resulting analysis is illustrated by practical sition papers, we raise issues connected not only with large vol-
applications and provides general suggestions concerning lines of umes of data and concept drift, but also such practical problems
future research in data stream mining. as privacy constraints, availability of information, and dealing with
legacy systems.
The scope of this paper is not restricted to algorithmic challenges,
1. INTRODUCTION it aims at covering the full cycle of knowledge discovery from data
The volumes of automatically generated data are constantly in- (CRISP [40]), from understanding the context of the task, to data
creasing. According to the Digital Universe Study [18], over 2.8ZB preparation, modeling, evaluation, and deployment. We discuss
of data were created and processed in 2012, with a projected in- eight challenges: making models simpler, protecting privacy and
crease of 15 times by 2020. This growth in the production of dig- confidentiality, dealing with legacy systems, stream preprocessing,
ital data results from our surrounding environment being equipped timing and availability of information, relational stream mining,
with more and more sensors. People carrying smart phones produce analyzing event data, and evaluation of stream mining algorithms.
data, database transactions are being counted and stored, streams of Figure 1 illustrates the positioning of these challenges in the CRISP
data are extracted from virtual environments in the form of logs or cycle. Some of these apply to traditional (non-streaming) data min-
user generated content. A significant part of such data is volatile, ing as well, but they are critical in streaming environments. Along
which means it needs to be analyzed in real time as it arrives. Data with further discussion of these challenges, we present our position
stream mining is a research field that studies methods and algo- where the forthcoming focus of research and development efforts
rithms for extracting knowledge from volatile streaming data [14; should be directed to address these challenges.
5; 1]. Although data streams, online learning, big data, and adapta- In the remainder of the article, section 2 gives a brief introduction to
tion to concept drift have become important research topics during data stream mining, sections 37 discuss each identified challenge,
and section 8 highlights action points for future research.

1
http://sites.google.com/site/realstream2013

SIGKDD Explorations Volume 16, Issue 1 Page 1


 
  
   singular or recurring contexts: In the former case, a model
 becomes obsolete once and for all when its context is re-
 
 placed by a novel context. In the latter case, a models con-
 

    text might reoccur at a later moment in time, for example due

 
 to a business cycle or seasonality, therefore, obsolete models

 might still regain value.
 
 
 
   systematic or unsystematic: In the former case, there are
patterns in the way the distributions change that can be ex-
  

 
 
 ploited to predict change and perform faster model adapta-
  

 
   

 
 tion. Examples are subpopulations that can be identified and
show distinct, trackable evolutionary patterns. In the latter



case, no such patterns exist and drift occurs seemingly at ran-
 


 
  dom. An example for the latter is fickle concept drift.



 !  real or virtual: While the former requires model adaptation,




the latter corresponds to observing outliers or noise, which
 

 
 

    

should not be incorporated into a model.

 " 
Stream mining approaches in general address the challenges posed

$    

  by volume, velocity and volatility of data. However, in real-world


 applications these three challenges often coincide with other, to


   #
 date insufficiently considered ones.
 

The next sections discuss eight identified challenges for data stream
mining, providing illustrations with real world application exam-
Figure 1: CRISP cycle with data stream research challenges. ples, and formulating suggestions for forthcoming research.

3. PROTECTING PRIVACY AND CONFI-


2. DATA STREAM MINING
Mining big data streams faces three principal challenges: volume,
DENTIALITY
velocity, and volatility. Volume and velocity require a high volume Data streams present new challenges and opportunities with respect
of data to be processed in limited time. Starting from the first arriv- to protecting privacy and confidentiality in data mining. Privacy
ing instance, the amount of available data constantly increases from preserving data mining has been studied for over a decade (see.
zero to potentially infinity. This requires incremental approaches e.g. [3]). The main objective is to develop such data mining tech-
that incorporate information as it becomes available, and online niques that would not uncover information or patterns which com-
processing if not all data can be kept [15]. Volatility, on the other promise confidentiality and privacy obligations. Modeling can be
hand, corresponds to a dynamic environment with ever-changing done on original or anonymized data, but when the model is re-
patterns. Here, old data is of limited use, even if it could be saved leased, it should not contain information that may violate privacy
and processed again later. This is due to change, that can affect the or confidentiality. This is typically achieved by controlled distor-
induced data mining models in multiple ways: change of the target tion of sensitive data by modifying the values or adding noise.
variable, change in the available feature information, and drift. Ensuring privacy and confidentiality is important for gaining trust
Changes of the target variable occur for example in credit scor- of the users and the society in autonomous, stream data mining
ing, when the definition of the classification target default versus systems. While in offline data mining a human analyst working
non-default changes due to business or regulatory requirements. with the data can do a sanity check before releasing the model, in
Changes in the available feature information arise when new fea- data stream mining privacy preservation needs to be done online.
tures become available, e.g. due to a new sensor or instrument. Several existing works relate to privacy preservation in publishing
Similarly, existing features might need to be excluded due to regu- streaming data (e.g. [46]), but no systematic research in relation to
latory requirements, or a feature might change in its scale, if data broader data stream challenges exists.
from a more precise instrument becomes available. Finally, drift is We identify two main challenges for privacy preservation in mining
a phenomenon that occurs when the distributions of features x and data streams. The first challenge is incompleteness of information.
target variables y change in time. The challenge posed by drift has Data arrives in portions and the model is updated online. There-
been subject to extensive research, thus we provide here solely a fore, the model is never final and it is difficult to judge privacy
brief categorization and refer to recent surveys like [17]. preservation before seeing all the data. For example, suppose GPS
In supervised learning, drift can affect the posterior P (y|x), the traces of individuals are being collected for modeling traffic situa-
conditional feature P (x|y), the feature P (x) and the class prior tion. Suppose person A at current time travels from the campus to
P (y) distribution. The distinction based on which distribution is the airport. The privacy of a person will be compromised, if there
assumed to be affected, and which is assumed to be static, serves to are no similar trips by other persons in the very near future. How-
assess the suitability of an approach for a particular task. It is worth ever, near future trips are unknown at the current time, when the
noting, that the problem of changing distributions is also present in model needs to be updated.
unsupervised learning from data streams. On the other hand, data stream mining algorithms may have some
A further categorization of drift can be made by: inherent privacy preservation properties due to the fact that they do
not need to see all the modeling data at once, and can be incremen-
smoothness of concept transition: Transitions between con- tally updated with portions of data. Investigating privacy preser-
cepts can be sudden or gradual. The former is sometimes also vation properties of existing data stream algorithms makes another
denoted in literature as shift or abrupt drift. interesting direction for future research.

SIGKDD Explorations Volume 16, Issue 1 Page 2


The second important challenge for privacy preservation is concept actions before feeding the newest data to the predictive models.
drift. As data may evolve over time, fixed privacy preservation The problem of preprocessing for data streams is challenging due to
rules may no longer hold. For example, suppose winter comes, the challenging nature of the data (continuously arriving and evolv-
snow falls, and much less people commute by bike. By knowing ing). An analyst cannot know for sure, what kind of data to expect
that a person comes to work by bike and having a set of GPS traces, in the future, and cannot deterministically enumerate possible ac-
it may not be possible to identify this person uniquely in summer, tions. Therefore, not only models, but also the procedure itself
when there are many cyclists, but possible in winter. Hence, an im- needs to be fully automated.
portant direction for future research is to develop adaptive privacy This research problem can be approached from several angles. One
preservation mechanisms, that would diagnose such a situation and way is to look at existing predictive models for data streams, and
adapt themselves to preserve privacy in the new circumstances. try to integrate them with selected data preprocessing methods (e.g.
feature selection, outlier definition and removal).
Another way is to systematically characterize the existing offline
4. STREAMED DATA MANAGEMENT data preprocessing approaches, try to find a mapping between those
Most of the data stream research concentrates on developing pre- approaches and problem settings in data streams, and extend pre-
dictive models that address a simplified scenario, in which data is processing approaches for data streams in such a way as traditional
already pre-processed, completely and immediately available for predictive models have been extended for data stream settings.
free. However, successful business implementations depend strongly In either case, developing individual methods and methodology for
on the alignment of the used machine learning algorithms with preprocessing of data streams would bridge an important gap in the
both, the business objectives, and the available data. This section practical applications of data stream mining.
discusses often omitted challenges connected with streaming data.

4.1 Streamed Preprocessing 4.2 Timing and Availability of Information


Data preprocessing is an important step in all real world data anal- Most algorithms developed for evolving data streams make simpli-
ysis applications, since data comes from complex environments, fying assumptions on the timing and availability of information. In
may be noisy, redundant, contain outliers and missing values. Many particular, they assume that information is complete, immediately
standard procedures for preprocessing offline data are available and available, and received passively and for free. These assumptions
well established, see e.g. [33]; however, the data stream setting in- often do not hold in real-world applications, e.g., patient monitor-
troduces new challenges that have not received sufficient research ing, robot vision, or marketing [43]. This section is dedicated to the
attention yet. discussion of these assumptions and the challenges resulting from
their absence. For some of these challenges, corresponding situ-
While in traditional offline analysis data preprocessing is a once-off
ations in offline, static data mining have already been addressed
procedure, usually done by a human expert prior to modeling, in the
in literature. We will briefly point out where a mapping of such
streaming scenario manual processing is not feasible, as new data
known solutions to the online, evolving stream setting is easily fea-
continuously arrives. Streaming data needs fully automated pre-
sible, for example by applying windowing techniques. However,
processing methods, that can optimize the parameters and operate
we will focus on problems for which no such simple mapping ex-
autonomously. Moreover, preprocessing models need to be able to
ists and which are therefore open challenges in stream mining.
update themselves automatically along with evolving data, in a sim-
ilar way as predictive models for streaming data do. Furthermore,
all updates of preprocessing procedures need to be synchronized 4.2.1 Handling Incomplete Information
with the subsequent predictive models, otherwise after an update in Completeness of information assumes that the true values of all
preprocessing the data representation may change and, as a result, variables, that is of features and of the target, are revealed eventu-
the previously used predictive model may become useless. ally to the mining algorithm.
Except for some studies, mainly focusing on feature construction The problem of missing values, which corresponds to incomplete-
over data streams, e.g. [49; 4], no systematic methodology for data ness of features, has been discussed extensively for the offline,
stream preprocessing is currently available. static settings. A recent survey is given in [45]. However, only few
As an illustrative example for challenges related to data preprocess- works address data streams, and in particular evolving data streams.
ing, consider predicting traffic jams based on mobile sensing data. Thus several open challenges remain, some are pointed out in the
People using navigation services on mobile devices can opt to send review by [29]: how to address the problem that the frequency in
anonymized data to the service provider. Service providers, such as which missing values occur is unpredictable, but largely affects the
Google, Yandex or Nokia, provide estimations and predictions of quality of imputations? How to (automatically) select the best im-
traffic jams based on this data. First, the data of each user is mapped putation technique? How to proceed in the trade-off between speed
to the road network, the speed of each user on each road segment and statistical accuracy?
of the trip is computed, data from multiple users is aggregated, and Another problem is that of missing values of the target variable. It
finally the current speed of the traffic is estimated. has been studied extensively in the static setting as semi-supervised
There are a lot of data preprocessing challenges associated with learning (SSL, see [11]). A requirement for applying SSL tech-
this task. First, noisiness of GPS data might vary depending on niques to streams is the availability of at least some labeled data
location and load of the telecommunication network. There may from the most recent distribution. While first attempts to this prob-
be outliers, for instance, if somebody stopped in the middle of a lem have been made, e.g. the online manifold regularization ap-
segment to wait for a passenger, or a car broke. The number of proach in [19] and the ensembles-based approach suggested by
pedestrians using mobile navigation may vary, and require adaptive [11], improvements in speed and the provision of performance guar-
instance selection. Moreover, road networks may change over time, antees remain open challenges. A special case of incomplete infor-
leading to changes in average speeds, in the number of cars and mation is censored data in Event History Analysis (EHA), which
even car types (e.g. heavy trucks might be banned, new optimal is described in section 5.2. A related problem discussed below is
routes emerge). All these issues require automated preprocessing active learning (AL, see [38]).

SIGKDD Explorations Volume 16, Issue 1 Page 3


4.2.2 Dealing with Skewed Distributions uncertainty regarding convergence: in contrast to learning
Class imbalance, where the class prior probability of the minor- in static contexts, due to drift there is no guarantee that with
ity class is small compared to that of the majority class, is a fre- additional labels the difference between model and reality
quent problem in real-world applications like fraud detection or narrows down. This leaves the formulation of suitable stop
credit scoring. This problem has been well studied in the offline criteria a challenging open issue.
setting (see e.g. [22] for a recent book on that subject), and has
necessity of perpetual validation: even if there has been
also been studied to some extent in the online, stream-based setting
convergence due to some temporary stability, the learned hy-
(see [23] for a recent survey). However, among the few existing
potheses can get invalidated at any time by subsequent drift.
stream-based approaches, most do not pay attention to drift of the
This can affect any part of the feature space and is not nec-
minority class, and as [23] pointed out, a more rigorous evaluation
essarily detectable from unlabeled data. Thus, without per-
of these algorithms on real-world data needs yet to be done.
petual validation the mining algorithm might lock itself to a
wrong hypothesis without ever noticing.
4.2.3 Handling Delayed Information
temporal budget allocation: the necessity of perpetual vali-
Latency means information becomes available with significant de-
dation raises the question of optimally allocating the labeling
lay. For example, in the case of so-called verification latency, the
budget over time.
value of the preceding instances target variable is not available be-
fore the subsequent instance has to be predicted. On evolving data performance bounds: in the case of drifting posteriors, no
streams, this is more than a mere problem of streaming data inte- theoretical work exists that provides bounds for errors and
gration between feature and target streams, as due to concept drift label requests. However, deriving such bounds will also re-
patterns show temporal locality [2]. It means that feedback on the quire assuming some type of systematic drift.
current prediction is not available to improve the subsequent pre-
dictions, but only eventually will become available for much later The task of active feature acquisition, where one has to actively
predictions. Thus, there is no recent sample of labeled data at all select among costly features, constitutes another open challenge on
that would correspond to the most-recent unlabeled data, and semi- evolving data streams: in contrast to the static, offline setting, the
supervised learning approaches are not directly applicable. value of a feature is likely to change with its drifting distribution.
A related problem in static, offline data mining is that addressed
by unsupervised transductive transfer learning (or unsupervised do- 5. MINING ENTITIES AND EVENTS
main adaptation): given labeled data from a source domain, a pre-
Conventional stream mining algorithms learn over a single stream
dictive model is sought for a related target domain in which no
of arriving entities. In subsection 5.1, we introduce the paradigm
labeled data is available. In principle, ideas from transfer learning
of entity stream mining, where the entities constituting the stream
could be used to address latency in evolving data streams, for ex-
are linked to instances (structured pieces of information) from fur-
ample by employing them in a chunk-based approach, as suggested
ther streams. Model learning in this paradigm involves the incor-
in [43]. However, adapting them for use in evolving data streams
poration of the streaming information into the stream of entities;
has not been tried yet and constitutes a non-trivial, open task, as
learning tasks include cluster evolution, migration of entities from
adaptation in streams must be fast and fully automated and thus
one state to another, classifier adaptation as entities re-appear with
cannot rely on iterated careful tuning by human experts.
another label than before.
Furthermore, consecutive chunks constitute several domains, thus
Then, in subsection 5.2, we investigate the special case where en-
the transitions between several subsequent chunks might provide
tities are associated with the occurrence of events. Model learning
exploitable patterns of systematic drift. This idea has been in-
then implies identifying the moment of occurrence of an event on
troduced in [27], and a few so-called drift-mining algorithms that
an entity. This scenario might be seen as a special case of entity
identify and exploit such patterns have been proposed since then.
stream mining, since an event can be seen as a degenerate instance
However, the existing approaches cover only a very limited set of
consisting of a single value (the events occurrence).
possible drift patterns and scenarios.
5.1 Entity Stream Mining
4.2.4 Active Selection from Costly Information Let T be a stream of entities, e.g. customers of a company or pa-
The challenge of intelligently selecting among costly pieces of in- tients of a hospital. We observe entities over time, e.g. on a com-
formation is the subject of active learning research. Active stream- panys website or at a hospital admission vicinity: an entity appears
based selective sampling [38] describes a scenario, in which in- and re-appears at discrete time points, new entities show up. At a
stances arrive one-by-one. While the instances feature vectors are time point t, an entity e T is linked with different pieces of in-
provided for free, obtaining their true target values is costly, and the formation - the purchases and ratings performed by a customer, the
definitive decision whether or not to request this target value must anamnesis, the medical tests and the diagnosis recorded for the pa-
be taken before proceeding to the next instance. This corresponds tient. Each of these information pieces ij (t) is a structured record
to a data stream, but not necessarily to an evolving one. As a result, or an unstructured text from a stream Tj , linked to e via the foreign
only a small subset of stream-based selective sampling algorithms key relation. Thus, the entities in T are in 1-to-1 or 1-to-n relation
is suited for non-stationary environments. To make things worse, with entities from further streams T1 , . . . , Tm (stream of purchases,
many contributions do not state explicitly whether they were de- stream of ratings, stream of complaints etc). The schema describ-
signed for drift, neither do they provide experimental evaluations ing the streams T, T1 , . . . , Tm can be perceived as a conventional
on such evolving data streams, thus leaving the reader the ardu- relational schema, except that it describes streams instead of static
ous task to assess their suitability for evolving streams. A first, re- sets.
cent attempt to provide an overview on the existing active learning In this relational setting, the entity stream mining task corresponds
strategies for evolving data streams is given in [43]. The challenges to learning a model T over T , thereby incorporating information
for active learning posed by evolving data streams are: from the adjoint streams T1 , . . . , Tm that feed the entities in T .

SIGKDD Explorations Volume 16, Issue 1 Page 4


Albeit the members of each stream are entities, we use the term entities, thus the challenges pertinent to stream mining also apply
entity only for stream T the target of learning, while we denote here. One of these challenges, and one much discussed in the con-
the entities in the other streams as instances. In the unsupervised text of big data, is volatility. In relational stream mining, volatility
setting, entity stream clustering encompasses learning and adapting refers to the entity itself, not only to the stream of instances that
clusters over T , taking account the other streams that arrive at dif- reference the entities. Finally, an entity is ultimately big data by
ferent speeds. In the supervised setting, entity stream classification itself, since it is described by multiple streams. Hence, next to the
involves learning and adapting a classifier, notwithstanding the fact problem of dealing with new forms of learning and new aspects of
that an entitys label may change from one time point to the next, drift, the subject of efficient learning and adaption in the Big Data
as new instances referencing it arrive. context becomes paramount.

5.1.1 Challenges of Aggregation 5.2 Analyzing Event Data


The first challenge of entity stream mining task concerns informa- Events are an example for data that occurs often yet is rarely ana-
tion summarization: how to aggregate into each entity e at each lyzed in the stream setting. In static environments, events are usu-
time point t the information available on it from the other streams? ally studied through event history analysis (EHA), a statistical me-
What information should be stored for each entity? How to deal thod for modeling and analyzing the temporal distribution of events
with differences in the speeds of the individual streams? How to related to specific objects in the course of their lifetime [9]. More
learn over the streams efficiently? Answering these questions in a specifically, EHA is interested in the duration before the occurrence
seamless way would allow us to deploy conventional stream mining of an event or, in the recurrent case (where the same event can oc-
methods for entity stream mining after aggregation. cur repeatedly), the duration between two events. The notion of
The information referencing a relational entity cannot be held per- an event is completely generic and may indicate, for example, the
petually for learning, hence aggregation of the arriving streams is failure of an electrical device. The method is perhaps even better
necessary. Information aggregation over time-stamped data is tra- known as survival analysis, a term that originates from applications
ditionally practiced in document stream mining, where the objec- in medicine, in which an event is the death of a patient and survival
tive is to derive and adapt content summaries on learned topics. time is the time period between the beginning of the study and the
Content summarization on entities, which are referenced in the doc- occurrence of this event. EHA can also be considered as a special
ument stream, is studied by Kotov et al., who maintain for each case of entity stream mining described in section 5.1, because the
entity the number of times it is mentioned in the news [26]. basic statistical entities in EHA are monitored objects (or subjects),
In such studies, summarization is a task by itself. Aggregation of typically described in terms of feature vectors x Rn , together
information for subsequent learning is a bit more challenging, be- with their survival time s. Then, the goal is to model the depen-
cause summarization implies information loss - notably informa- dence of s on x. A corresponding model provides hints at possible
tion about the evolution of an entity. Hassani and Seidl monitor cause-effect relationships (e.g., what properties tend to increase a
health parameters of patients, modeling the stream of recordings patients survival time) and, moreover, can be used for predictive
on a patient as a sequence of events [21]: the learning task is then purposes (e.g., what is the expected survival time of a patient).
to predict forthcoming values. Aggregation with selective forget- Although one might be tempted to approach this modeling task as
ting of past information is proposed in [25; 42] in the classification a standard regression problem with input (regressor) x and out-
context: the former method [25] slides a window over the stream, put (response) s, it is important to notice that the survival time s
while the latter [42] forgets entities that have not appeared for a is normally not observed for all objects. Indeed, the problem of
while, and summarizes the information in frequent itemsets, which censoring plays an important role in EHA and occurs in different
are then used as new features for learning. facets. In particular, it may happen that some of the objects sur-
vived till the end of the study at time tend (also called the cut-off
5.1.2 Challenges of Learning point). They are censored or, more specifically, right censored,
Even if information aggregation over the streams T1 , . . . , Tm is since tevent has not been observed for them; instead, it is only
performed intelligently, entity stream mining still calls for more known that tevent > tend . In snapshot monitoring [28], the data
than conventional stream mining methods. The reason is that enti- stream may be sampled multiple times, resulting in a new cut-off
ties of stream T re-appear in the stream and evolve. In particular, point for each snapshot. Unlike standard regression analysis, EHA
in the unsupervised setting, an entity may be linked to conceptu- is specifically tailored for analyzing event data of that kind. It is
ally different instances at each time point, e.g. reflecting a cus- built upon the hazard function as a basic mathematical tool.
tomers change in preferences. In the supervised setting, an entity
may change its label; for example, a customers affinity to risk may 5.2.1 Survival function and hazard rate
change in response to market changes or to changes in family sta- Suppose the time of occurrence of the next event (since the start or
tus. This corresponds to entity drift, i.e. a new type of drift beyond the last event) for an object x is modeled as a real-valued random
the conventional concept drift pertaining to model T . Hence, how variable T with probability density function f ( | x). The hazard
should entity drift be traced, and how should the interplay between function or hazard rate h( | x) models the propensity of the occur-
entity drift and model drift be captured? rence of an event, that is, the marginal probability of an event to
In the unsupervised setting, Oliveira and Gama learn and monitor occur at time t, given that no event has occurred so far:
clusters as states of evolution [32], while [41] extend that work to
f (t | x) f (t | x)
learn Markov chains that mark the entities evolution. As pointed h(t | x) = = ,
out in [32], these states are not necessarily predefined they must S(t | x) 1 F (t | x)
be subject of learning. In [43], we report on further solutions to where S( | x) is the survival function and F ( | x) the cumulative
the entity evolution problem and to the problem of learning with distribution of f ( | x). Thus,
forgetting over multiple streams and over the entities referenced by
 t
them.
Conventional concept drift also occurs when learning a model over F (t | x) = P(T t) = f (u | x) du
0

SIGKDD Explorations Volume 16, Issue 1 Page 5


is the probability of an event to occur before time t. Correspond- Dealing with model changes of that kind is clearly an important
ingly, S(t | x) = 1 F (t | x) is the probability that the event did challenge for event analysis on data streams. Although the problem
not occur until time t (the survival probability). It can hence be is to some extent addressed by the works mentioned above, there
used to model the probability of the right-censoring of the time for is certainly scope for further improvement, and for using these ap-
an event to occur. proaches to derive predictive models from censored data. Besides,
A simple example is the Cox proportional hazard model [9], in there are many other directions for future work. For example, since
which the hazard rate is constant over time; thus, it does depend the detection of events is a main prerequisite for analyzing them,
on the feature vector x = (x1 , . . . , xn ) but not on time t. More the combination of EHA with methods for event detection [36] is
specifically, the hazard rate is modeled as a log-linear function of an important challenge. Indeed, this problem is often far from triv-
the features xi : ial, and in many cases, events (such as frauds, for example) can only
  be detected with a certain time delay; dealing with delayed events
h(t | x) = (x) = exp x is therefore another important topic, which was also discussed in
The model is proportional in the sense that increasing xi by one section 4.2.
unit increases the hazard rate (x) by a factor of i = exp(i ).
For this model, one easily derives the survival function S(t | x) =
1 exp((x) t) and an expected survival time of 1/(x). 6. EVALUATION OF DATA STREAM AL-
GORITHMS
5.2.2 EHA on data streams All of the aforementioned challenges are milestones on the road to
Although the temporal nature of event data naturally fits the data better algorithms for real-world data stream mining systems. To
stream model and, moreover, event data is naturally produced by verify if these challenges are met, practitioners need tools capa-
many data sources, EHA has been considered in the data stream ble of evaluating newly proposed solutions. Although in the field
scenario only very recently. In [39], the authors propose a method of static classification such tools exist, they are insufficient in data
for analyzing earthquake and Twitter data, namely an extension of stream environments due to such problems as: concept drift, lim-
the above Cox model based on a sliding window approach. The ited processing time, verification latency, multiple stream struc-
authors of [28] modify standard classification algorithms, such as tures, evolving class skew, censored data, and changing misclassi-
decision trees, so that they can be trained on a snapshot stream of fication costs. In fact, the myriad of additional complexities posed
both censored and non-censored data. by data streams makes algorithm evaluation a highly multi-criterial
Like in the case of clustering [35], where one distinguishes between task, in which optimal trade-offs may change over time.
clustering observations and clustering data sources, two different Recent developments in applied machine learning [6] emphasize
settings can be envisioned for EHA on data streams: the importance of understanding the data one is working with and
using evaluation metrics which reflect its difficulties. As men-
1. In the first setting, events are generated by multiple data sources tioned before, data streams set new requirements compared to tra-
(representing monitored objects), and the features pertain to ditional data mining and researchers are beginning to acknowl-
these sources; thus, each data source is characterized by a edge the shortcomings of existing evaluation metrics. For exam-
feature vector x and produces a stream of (recurrent) events. ple, Gama et al. [16] proposed a way of calculating classification
For example, data sources could be users in a computer net- accuracy using only the most recent stream examples, therefore al-
work, and an event occurs whenever a user sends an email. lowing for time-oriented evaluation and aiding concept drift detec-
2. In the second setting, events are produced by a single data tion. Methods which test the classifiers robustness to drifts and
source, but now the events themselves are characterized by noise on a practical, experimental level are also starting to arise
features. For example, events might be emails sent by an [34; 47]. However, all these evaluation techniques focus on sin-
email server, and each email is represented by a certain set gle criteria such as prediction accuracy or robustness to drifts, even
of properties. though data streams make evaluation a constant trade-off between
several criteria [7]. Moreover, in data stream environments there is
Statistical event models on data streams can be used in much the a need for more advanced tools for visualizing changes in algorithm
same way as in the case of static data. For example, they can serve predictions with time.
predictive purposes, i.e., to answer questions such as How much The problem of creating complex evaluation methods for stream
time will elapse before the next email arrives? or What is the mining algorithms lies mainly in the size and evolving nature of
probability to receive more than 100 emails within the next hour?. data streams. It is much more difficult to estimate and visualize,
What is specifically interesting, however, and indeed distinguishes for example, prediction accuracy if evaluation must be done on-
the data stream setting from the static case, is the fact that the model line, using limited resources, and the classification task changes
may change over time. This is a subtle aspect, because the hazard with time. In fact, the algorithms ability to adapt is another as-
model h(t | x) itself may already be time-dependent; here, how- pect which needs to be evaluated, although information needed to
ever, t is not the absolute time but the duration time, i.e., the time perform such evaluation is not always available. Concept drifts are
elapsed since the last event. A change of the model is compara- known in advance mainly when using synthetic or benchmark data,
ble to concept drift in classification, and means that the way in while in more practical scenarios occurrences and types of concepts
which the hazard rate depends on time t and on the features xi are not directly known and only the label of each arriving instance
changes over time. For example, consider the event increase of is known. Moreover, in many cases the task is more complicated, as
a stock rate and suppose that i = log(2) for the binary feature labeling information is not instantly available. Other difficulties in
xi = energy sector in the above Cox model (which, as already evaluation include processing complex relational streams and cop-
mentioned, does not depend on t). Thus, this feature doubles the ing with class imbalance when class distributions evolve with time.
hazard rate and hence halves the expected duration between two Finally, not only do we need measures for evaluating single aspects
events. Needless to say, however, this influence may change over of stream mining algorithms, but also ways of combining several of
time, depending on how well the energy sector is doing. these aspects into global evaluation models, which would take into

SIGKDD Explorations Volume 16, Issue 1 Page 6


account expert knowledge and user preferences. 7.1 Making models simpler, more reactive, and
Clearly, evaluation of data stream algorithms is a fertile ground more specialized
for novel theoretical and algorithmic solutions. In terms of pre- In this subsection, we discuss aspects like the simplicity of a model,
diction measures, data stream mining still requires evaluation tools its proper combination of offline and online components, and its
that would be immune to class imbalance and robust to noise. In customization to the requirements of the application domain. As
our opinion, solutions to this problem should involve not only met- an application example, consider the French Orange Portal2 , which
rics based on relative performance to baseline (chance) classifiers, registers millions of visits daily. Most of these visitors are only
but also graphical measures similar to PR-curves or cost curves. known through anonymous cookie IDs. For all of these visitors,
Furthermore, there is a need for integrating information about con- the portal has the ambition to provide specific and relevant contents
cept drifts in the evaluation process. As mentioned earlier, possible as well as printing ads for targeted audiences. Using information
ways of considering concept drifts will depend on the information about visits on the portal the questions are: what part of the portal
that is available. If true concepts are known, algorithms could be does each cookie visit, and when and which contents did it consult,
evaluated based on: how often they detect drift, how early they de- what advertisement was sent, when (if) was it clicked. All this in-
tect it, how they react to it, and how quickly they recover from it. formation generates hundreds of gigabytes of data each week. A
Moreover, in this scenario, evaluation of an algorithm should be user profiling system needs to have a back end part to preprocess
dependent on whether it takes place during drift or during times of the information required at the input of a front end part, which will
concept stability. A possible way of tackling this problem would be compute appetency to advertising (for example) using stream min-
the proposal of graphical methods, similar to ROC analysis, which ing techniques (in this case a supervised classifier). Since the ads
would work online and visualize concept drift measures alongside to print change regularly, based on marketing campaigns, the ex-
prediction measures. Additionally, these graphical measures could tensive parameter tuning is infeasible as one has to react quickly to
take into account the state of the stream, for example, its speed, change. Currently, these tasks are either solved using bandit meth-
number of missing values, or class distribution. Similar methods ods from game theory [8], which impairs adaptation to drift, or
could be proposed for scenarios where concepts are not known in done offline in big data systems, resulting in slow reactivity.
advance, however, in these cases measures should be based on drift
detectors or label-independent stream statistics. Above all, due to 7.1.1 Minimizing parameter dependence
the number of aspects which need to be measured, we believe that Adaptive predictive systems are intrinsically parametrized. In most
the evaluation of data stream algorithms requires a multi-criterial of the cases, setting these parameters, or tuning them is a difficult
view. This could be done by using inspirations from multiple crite- task, which in turn negatively affects the usability of these systems.
ria decision analysis, where trade-offs between criteria are achieved Therefore, it is strongly desired for the system to have as few user
using user-feedback. In particular, a user could showcase his/her adjustable parameters as possible. Unfortunately, the state of the
criteria preferences (for example, between memory consumption, art does not produce methods with trustworthy or easily adjustable
accuracy, reactivity, self-tuning, and adaptability) by deciding be- parameters. Moreover, many predictive modeling methods use a
tween alternative algorithms for a given data stream. It is worth lot of parameters, rendering them particularly impractical for data
noticing that such a multi-criterial view on evaluation is difficult to stream applications, where models are allowed to evolve over time,
encapsulate in a single number, as it is usually done in traditional and input parameters often need to evolve as well.
offline learning. This might suggest that researchers in this area
The process of predictive modeling encompasses fitting of parame-
should turn towards semi-qualitative and semi-quantitative evalua-
ters on a training dataset and subsequently selecting the best model,
tion, for which systematic methodologies should be developed.
either by heuristics or principled methods. Recently, model selec-
Finally, a separate research direction involves rethinking the way tion methods have been proposed that do not require internal cross-
we test data stream mining algorithms. The traditional train, cross- validation, but rather use the Bayesian machinery to design regu-
validate, test workflow in classification is not applicable for sequen- larizers with data dependent priors [20]. However, they are not yet
tial data, which makes, for instance, parameter tuning much more applicable in data streams, as their computational time complexity
difficult. Similarly, ground truth verification in unsupervised learn- is too high and they require all examples to be kept in memory.
ing is practically impossible in data stream environments. With
these problems in mind, it is worth stating that there is still a short- 7.1.2 Combining offline and online models
age of real and synthetic benchmark datasets. Such a situation
Online and offline learning are mostly considered as mutually ex-
might be a result of non-uniform standards for testing algorithms on
clusive, but it is their combination that might enhance the value
streaming data. As community, we should decide on such matters
of data the most. Online learning, which processes instances one-
as: What characteristics should benchmark datasets have? Should
by-one and builds models incrementally, has the virtue of being
they have prediction tasks attached? Should we move towards on-
fast, both in the processing of data and in the adaptation of mod-
line evaluation tools rather than datasets? These questions should
els. Offline (or batch) learning has the advantage of allowing the
be answered in order to solve evaluation issues in controlled envi-
use of more sophisticated mining techniques, which might be more
ronments before we create measures for real-world scenarios.
time-consuming or require a human expert. While the first allows
the processing of fast data that requires real-time processing and
adaptivity, the second allows processing of big data that requires
7. FROM ALGORITHMS TO DECISION longer processing time and larger abstraction.
Their combination can take place in many steps of the mining pro-
SUPPORT SYSTEMS cess, such as the data preparation and the preprocessing steps. For
While a lot of algorithmic methods for data streams are already example, offline learning on big data could extract fundamental and
available, their deployment in real applications with real streaming sustainable trends from data using batch processing and massive
data presents a new dimension of challenges. This section points parallelism. Online learning could then take real-time decisions
out two such challenges: making models simpler and dealing with
2
legacy systems. www.orange.fr

SIGKDD Explorations Volume 16, Issue 1 Page 7


from online events to optimize an immediate pay-off. In the online mining can make a decisive contribution to enhance and facilitate
advertisement application mentioned above, the user-click predic- the required monitoring tasks. Recently, we are planning to use the
tion is done within a context, defined for example by the currently ISS Columbus module as a technology demonstrator for integrat-
viewed page and the profile of the cookie. The decision which ing data stream processing and mining into the existing monitoring
banner to display is done online, but the context can be prepro- processes [31]. Figure 2 exemplifies the failure management sys-
cessed offline. By deriving meta-information such as the profile is tem (FMS) of the ISS Columbus module. While it is impossible to
a young male, the page is from the sport cluster, the offline com- simply redesign the FMS from scratch, we can outline the follow-
ponent can ease the online decision task. ing challenges.

7.1.3 Solving the right problem


Domain knowledge may help to solve many issues raised in this
paper, by systematically exploiting particularities of application do-
mains. However, this is seldom considered, as typical data stream
methods are created to deal with a large variety of domains. For in-
stance, in some domains the learning algorithm receives only par-
tial feedback upon its prediction, i.e. a single bit of right-or-wrong,
rather than the true label. In the user-click prediction example, if a 1. ISS Columbus module
user does not click on a banner, we do not know which one would
have been correct, but solely that the displayed one was wrong.
This is related to the issues on timing and availability of informa-
tion discussed in section 4.2.
5. Assembly, integration, 2. Ground control
However, building predictive models that systematically incorpo- centre
and test facility
rate domain knowledge or domain specific information requires
to choose the right optimization criteria. As mentioned in sec-
tion 6, the data stream setting requires optimizing multiple criteria
simultaneously, as optimizing only predictive performance is not
sufficient. We need to develop learning algorithms, which mini-
mize an objective function including intrinsically and simultane- 4. Mission archiv 3. Engineering support
ously: memory consumption, predictive performance, reactivity, centre
self monitoring and tuning, and (explainable) auto-adaptivity. Data
streams research is lacking methodologies for forming and opti- Figure 2: ISS Columbus FMS
mizing such criteria.
Therefore, models should be simple so that they do not depend on
a set of carefully tuned parameters. Additionally, they should com-
bine offline and online techniques to address challenges of big and 7.2.2 Complexity
fast data, and they should solve the right problem, which might Even though spacecraft monitoring is very challenging by itself,
consist in solving a multi-criteria optimization task. Finally, they it becomes increasingly difficult and complex due to the integra-
have to be able to learn from a small amount of data and with low tion of data stream mining into such legacy systems. However,
variance [37], to react quickly to drift. it was assumed to enhance and facilitate current monitoring pro-
cesses. Thus, appropriate mechanism are required to integrate data
7.2 Dealing with Legacy Systems stream mining into the current processes to decrease complexity.
In many application environments, such as financial services or
health care systems, business critical applications are in operation 7.2.3 Interlocking
for decades. Since these applications produce massive amounts of As depicted in Figure 2, the ISS Columbus module is connected
data, it becomes very promising to process these amounts of data to ground instances. Real-time monitoring must be applied aboard
by real-time stream mining approaches. However, it is often impos- where computational resources are restricted (e.g. processor speed
sible to change existing infrastructures in order to introduce fully and memory or power consumption). Near real-time monitoring or
fledged stream mining systems. Rather than changing existing in- long-term analysis must be applied on-ground where the downlink
frastructures, approaches are required that integrate stream mining suffers from latencies because of a long transmission distance, is
techniques into legacy systems. In general, problems concerning subject to bandwidth limitations, and continuously interrupted due
legacy systems are domain-specific and encompass both technical to loss of signal. Consequently, new data stream mining mecha-
and procedural issues. In this section, we analyze challenges posed nisms are necessary which ensure a smooth interlocking function-
by a specific real-world application with legacy issues the ISS ality of aboard and ground instances.
Columbus spacecraft module.
7.2.4 Reliability and Balance
7.2.1 ISS Columbus The reliability of spacecrafts is indispensable for astronauts health
Spacecrafts are very complex systems, exposed to very different and mission success. Accordingly, spacecrafts pass very long and
physical environments (e.g. space), and associated to ground sta- expensive planning and testing phases. Hence, potential data stream
tions. These systems are under constant and remote monitoring mining algorithms must ensure reliability and the integration of
by means of telemetry and commands. The ISS Columbus mod- such algorithms into legacy systems must not cause critical side
ule has been in operation for more than 5 years. For some time, effects. Furthermore, data stream mining is an automatic process
it is pointed out that the monitoring process is not as efficient as which neglects interactions with human experts, while spacecraft
previously expected [30]. However, we assume that data stream monitoring is a semi-automatic process and human experts (e.g.

SIGKDD Explorations Volume 16, Issue 1 Page 8


the flight control team) are responsible for decisions and conse- funded by the German Research Foundation, projects SP 572/11-1
quent actions. This problem poses the following question: How to (IMPRINT) and HU 1284/5-1, the Academy of Finland grant 118653
integrate data stream mining into legacy systems when automation (ALGODAN), and the Polish National Science Center grants
needs to be increased but the human expert needs to be maintained DEC-2011/03/N/ST6/00360 and DEC-2013/11/B/ST6/00963.
in the loop? Abstract discussions on this topic are provided by ex-
pert systems [44] and the MAPE-K reference model [24]. Expert 9. REFERENCES
systems aim to combine human expertise with artificial expertise
and the MAPE-K reference model aims to provide an autonomic [1] C. Aggarwal, editor. Data Streams: Models and Algorithms.
control loop. A balance must be struck which considers both afore- Springer, 2007.
mentioned aspects appropriately.
Overall, the Columbus study has shown that extending legacy sys- [2] C. Aggarwal and D. Turaga. Mining data streams: Systems
tems with real time data stream mining technologies is feasible and and algorithms. In Machine Learning and Knowledge Dis-
it is an important area for further stream-mining research. covery for Engineering Systems Health Management, pages
432. Chapman and Hall, 2012.
8. CONCLUDING REMARKS [3] R. Agrawal and R. Srikant. Privacy-preserving data mining.
In this paper, we discussed research challenges for data streams, SIGMOD Rec., 29(2):439450, 2000.
originating from real-world applications. We analyzed issues con-
[4] C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what
cerning privacy, availability of information, relational and event
to observe next: Adaptive variable selection for regression in
streams, preprocessing, model complexity, evaluation, and legacy
multivariate data streams. In Proc. of the 2008 ACM Symp. on
systems. The discussed issues were illustrated by practical applica-
Applied Computing, SAC, pages 961965, 2008.
tions including GPS systems, Twitter analysis, earthquake predic-
tions, customer profiling, and spacecraft monitoring. The study of [5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom.
real-world problems highlighted shortcomings of existing method- Models and issues in data stream systems. In Proc. of the 21st
ologies and showcased previously unaddressed research issues. ACM SIGACT-SIGMOD-SIGART Symposium on Principles
Consequently, we call the data stream mining community to con- of Database Systems, PODS, pages 116, 2002.
sider the following action points for data stream research:
[6] C. Brodley, U. Rebbapragada, K. Small, and B. Wallace.
developing methods for ensuring privacy with incomplete Challenges and opportunities in applied machine learning. AI
information as data arrives, while taking into account the Magazine, 33(1):1124, 2012.
evolving nature of data;
[7] D. Brzezinski and J. Stefanowski. Reacting to different types
considering the availability of information by developing mod- of concept drift: The accuracy updated ensemble algorithm.
els that handle incomplete, delayed and/or costly feedback; IEEE Trans. on Neural Networks and Learning Systems.,
25:8194, 2014.
taking advantage of relations between streaming entities;
[8] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal
developing event detection methods and predictive models multi-armed bandits. In Proc. of the 22nd Conf. on Neural
for censored data; Information Processing Systems, NIPS, pages 273280, 2008.
developing a systematic methodology for streamed prepro- [9] D. Cox and D. Oakes. Analysis of Survival Data. Chapman &
cessing; Hall, London, 1984.
creating simpler models through multi-objective optimiza- [10] T. Dietterich. Machine-learning research. AI Magazine,
tion criteria, which consider not only accuracy, but also com- 18(4):97136, 1997.
putational resources, diagnostics, reactivity, interpretability;
[11] G. Ditzler and R. Polikar. Semi-supervised learning in non-
establishing a multi-criteria view towards evaluation, dealing stationary environments. In Proc. of the 2011 Int. Joint Conf.
with absence of the ground truth about how data changes; on Neural Networks, IJCNN, pages 2741 2748, 2011.
developing online monitoring systems, ensuring reliability of [12] W. Fan and A. Bifet. Mining big data: current status, and fore-
any updates, and balancing the distribution of resources. cast to the future. SIGKDD Explorations, 14(2):15, 2012.
As our study shows, there are challenges in every step of the CRISP [13] M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl.
data mining process. To date, modeling over data streams has Data stream mining in ubiquitous environments: state-of-the-
been viewed and approached as an extension of traditional meth- art and current directions. Wiley Interdisciplinary Reviews:
ods. However, our discussion and application examples show that Data Mining and Knowledge Discovery, 4(2):116 138,
in many cases it would be beneficial to step aside from building 2014.
upon existing offline approaches, and start blank considering what
[14] M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data
is required in the stream setting.
streams: A review. SIGMOD Rec., 34(2):1826, 2005.

Acknowledgments [15] J. Gama. Knowledge Discovery from Data Streams. Chapman


& Hall/CRC, 2010.
We would like to thank the participants of the RealStream2013
workshop at ECMLPKDD2013 in Prague, and in particular Bern- [16] J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating
hard Pfahringer and George Forman, for suggestions and discus- stream learning algorithms. Machine Learning, 90(3):317
sions on the challenges in stream mining. Part of this work was 346, 2013.

SIGKDD Explorations Volume 16, Issue 1 Page 9


[17] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and [33] D. Pyle. Data Preparation for Data Mining. Morgan Kauf-
A. Bouchachia. A survey on concept-drift adaptation. ACM mann Publishers Inc., 1999.
Computing Surveys, 46(4), 2014.
[34] T. Raeder and N. Chawla. Model monitor (m2 ): Evaluat-
[18] J. Gantz and D. Reinsel. The digital universe in 2020: Big ing, comparing, and monitoring models. Journal of Machine
data, bigger digital shadows, and biggest growth in the far Learning Research, 10:13871390, 2009.
east, December 2012.
[35] P. Rodrigues and J. Gama. Distributed clustering of ubiqui-
[19] A. Goldberg, M. Li, and X. Zhu. Online manifold regular- tous data streams. WIREs Data Mining and Knowledge Dis-
ization: A new learning setting and empirical study. In Proc. covery, pages 3854, 2013.
of the European Conf. on Machine Learning and Principles [36] T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for
of Knowledge Discovery in Databases, ECMLPKDD, pages real-time event detection and earthquake reporting system de-
393407, 2008. velopment. IEEE Trans. on Knowledge and Data Engineer-
[20] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selec- ing, 25(4):919931, 2013.
tion: Beyond the bayesian/frequentist divide. Journal of Ma- [37] C. Salperwyck and V. Lemaire. Learning with few examples:
chine Learning Research, 11:6187, 2010. An empirical study on leading classifiers. In Proc. of the 2011
[21] M. Hassani and T. Seidl. Towards a mobile health context Int. Joint Conf. on Neural Networks, IJCNN, pages 1010
prediction: Sequential pattern mining in multiple streams. 1019, 2011.
In Proc. of , IEEE Int. Conf. on Mobile Data Management, [38] B. Settles. Active Learning. Synthesis Lectures on Artificial
MDM, pages 5557, 2011. Intelligence and Machine Learning. Morgan and Claypool
Publishers, 2012.
[22] H. He and Y. Ma, editors. Imbalanced Learning: Founda-
tions, Algorithms, and Applications. IEEE, 2013. [39] A. Shaker and E. Hullermeier. Survival analysis on data
streams: Analyzing temporal events in dynamically chang-
[23] T. Hoens, R. Polikar, and N. Chawla. Learning from stream- ing environments. Int. Journal of Applied Mathematics and
ing data with concept drift and imbalance: an overview. Computer Science, 24(1):199212, 2014.
Progress in Artificial Intelligence, 1(1):89101, 2012.
[40] C. Shearer. The CRISP-DM model: the new blueprint for data
[24] IBM. An architectural blueprint for autonomic computing. mining. J Data Warehousing, 2000.
Technical report, IBM, 2003.
[41] Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou.
[25] E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Where are we going? predicting the evolution of individu-
Adaptive windowing for online learning from multiple inter- als. In Proc. of the 11th Int. Conf. on Advances in Intelligent
related data streams. In Proc. of the 11th IEEE Int. Conf. on Data Analysis, IDA, pages 357368, 2012.
Data Mining Workshops, ICDMW, pages 697704, 2011.
[42] Z. Siddiqui and M. Spiliopoulou. Classification rule mining
[26] A. Kotov, C. Zhai, and R. Sproat. Mining named entities for a stream of perennial objects. In Proc. of the 5th Int. Conf.
with temporally correlated bursts from multilingual web news on Rule-based Reasoning, Programming, and Applications,
streams. In Proc. of the 4th ACM Int. Conf. on Web Search and RuleML, pages 281296, 2011.
Data Mining, WSDM, pages 237246, 2011.
[43] M. Spiliopoulou and G. Krempl. Tutorial mining multiple
[27] G. Krempl. The algorithm APT to classify in concurrence of threads of streaming data. In Proc. of the Pacific-Asia Conf.
latency and drift. In Proc. of the 10th Int. Conf. on Advances on Knowledge Discovery and Data Mining, PAKDD, 2013.
in Intelligent Data Analysis, IDA, pages 222233, 2011.
[44] D. Waterman. A Guide to Expert Systems. Addison-Wesley,
[28] M. Last and H. Halpert. Survival analysis meets data stream 1986.
mining. In Proc. of the 1st Worksh. on Real-World Challenges [45] W. Young, G. Weckman, and W. Holland. A survey of
for Data Stream Mining, RealStream, pages 2629, 2013. methodologies for the treatment of missing values within
[29] F. Nelwamondo and T. Marwala. Key issues on computational datasets: limitations and benefits. Theoretical Issues in Er-
intelligence techniques for missing data imputation - a review. gonomics Science, 12, January 2011.
In Proc. of World Multi Conf. on Systemics, Cybernetics and [46] B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continu-
Informatics, volume 4, pages 3540, 2008. ous privacy preserving publishing of data streams. In Proc.
of the 12th Int. Conf. on Extending Database Technology,
[30] E. Noack, W. Belau, R. Wohlgemuth, R. Muller, S. Palumberi,
EDBT, pages 648659, 2009.
P. Parodi, and F. Burzagli. Efficiency of the columbus failure
management system. In Proc. of the AIAA 40th Int. Conf. on [47] I. Zliobaite. Controlled permutations for testing adaptive
Environmental Systems, 2010. learning models. Knowledge and Information Systems, In
Press, 2014.
[31] E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumloffel,
E. Hauke, J. Stamminger, and E. Frisk. The columbus module [48] I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama,
as a technology demonstrator for innovative failure manage- L. Minku, and K. Musial. Next challenges for adaptive learn-
ment. In German Air and Space Travel Congress, 2012. ing systems. SIGKDD Explorations, 14(1):4855, 2012.
[32] M. Oliveira and J. Gama. A framework to monitor clusters [49] I. Zliobaite and B. Gabrys. Adaptive preprocessing for
evolution applied to economy and finance problems. Intelli- streaming data. IEEE Trans. on Knowledge and Data Engi-
gent Data Analysis, 16(1):93111, 2012. neering, 26(2):309321, 2014.

SIGKDD Explorations Volume 16, Issue 1 Page 10


Twitter Analytics: A Big Data Management Perspective

Oshini Goonetilleke , Timos Sellis , Xiuzhen Zhang , Saket Sathe


Twitter Analytics:
School of ComputerA
Bigand
Science Data Management
IT, RMIT

Perspective
University, Melbourne, Australia
IBM Melbourne Research Laboratory, Australia
{oshini.goonetilleke, timos.sellis, xiuzhen.zhang}@rmit.edu.au, ssathe@au.ibm.com
Oshini Goonetilleke , Timos Sellis , Xiuzhen Zhang , Saket Sathe

School of Computer Science and IT, RMIT University, Melbourne, Australia

IBM Melbourne Research Laboratory, Australia
ABSTRACT
{oshini.goonetilleke, timos.sellis, xiuzhen.zhang @rmit.edu.au,
The }systems ssathe@au.ibm.com
that perform analysis in the context of these
interactions typically involve the following major compo-
With the inception of the Twitter microblogging platform
nents: focused crawling, data management and data ana-
in 2006, a myriad of research efforts have emerged studying
lytics. Here, data management comprises of information ex-
different aspects of the Twittersphere. Each study exploits
traction, pre-processing, data modeling and query process-
ABSTRACT
its own tools and mechanisms to capture, store, query and The components.
ing systems that Figure
perform1 analysis
shows a in the diagram
block context of of these
such
analyze Twitter data. Inevitably, platforms have been de- interactions
With the a system andtypically involve the following
depicts interactions major compo-
between various
veloped to inception
replace this of ad-hoc
the Twitter microblogging
exploration with a more platform
struc- nents: Until
focused crawling, data
in 2006, a myriad of research efforts have emerged studying nents. now, there has beenmanagement and data
significant amount ana-
of prior
tured and methodological form of analysis. Another body lytics. Here, dataimproving
management
different aspects of the research around eachcomprises of information
of the components shownex- in
of literature focuses on Twittersphere.
developing languagesEach study exploits
for querying traction,
its own tools and mechanisms to capture, Figure 1, pre-processing,
but to the best data
of ourmodeling
knowledge,andthere
queryhave
process-
been
Tweets. This paper addresses issues aroundstore,
the bigquery
data andna- ing frameworks
components.that Figure 1 shows a block diagram of such
analyze Twitterand data. Inevitably, no propose a unified approach to Twitter
ture of Twitter emphasizes theplatforms
need for new havedata
been de-
man- a system and depicts
veloped data management thatinteractions
seamlessly between
integratevarious
all thesecompo-
com-
agementtoand replace
querythis ad-hoc frameworks
language exploration withthat aaddress
more struc-
limi- nents. Until now, there has been significant amount
tured and methodological ponents. Following these observations, in this paperofweprior
ex-
tations of existing systems.form We of analysis.
review Another
existing body
approaches research
of literature focuses on developing languages for querying tensivelyaround
survey improving each that
the techniques of the components
have shownfor
been proposed in
that were developed to facilitate twitter analytics followed Figure 1,each
but of
to the
thecomponents
best of our knowledge, there1,have
Tweets. This paper addressesissuesissuesandaround the big data na- realising shown in Figure and been
then
by a discussion on research technical challenges no frameworks that propose a unified approach to Twitter
ture of Twitterintegrated
and emphasizes the need for new data man- motivate the need and challenges of a unified framework for
in developing solutions. data management that seamlessly integrate all these com-
agement and query language frameworks that address limi- managing Twitter data.
ponents. Following these observations, in this paper we ex-
tations of existing systems. We review existing approaches
1. INTRODUCTION
that were developed to facilitate twitter analytics followed
tensively survey the techniques that have been proposed for
The realising each of the components shown in Figure 1, and then
by a massive
discussion growth of dataissues
on research generated from social
and technical media
challenges
sources have resulted in solutions.
a growing interest on efficient and motivate the need and challenges of a unified framework for
in developing integrated
effective means of collecting, analyzing and querying large managing Twitter data.
volumes of social data. The existing platforms exploit sev-
1. INTRODUCTION
eral characteristics of big data, including large volumes of
The massive
data, velocity growth
due to theof data generated
streaming naturefrom socialand
of data, media
va-
sources
riety duehave resulted
to the in a growing
integration of data interest
from theon webefficient
and otherand
effective means
sources. Hence,ofsocial
collecting,
network analyzing and querying
data presents large
an excellent
volumes for
testbed of social
research data. Thedata.
on big existing platforms exploit sev-
eral characteristics
In particular, onlineof social
big data, including
networking andlarge volumes of
microblogging
data, velocity
platform Twitterdue has
to the
seen streaming
exponentialnature of data,
growth anduser
in its va-
riety due toitsthe
base since integration
inception of data
in 2006 withfrom
nowthe web200
over andmillion
other
sources.
monthly activeHence,userssocial network500
producing data presents
million tweetsan (Twitter-
excellent
testbed the
sphere, for research
postings on madebig data.
to Twitter) daily1 . A wide re-
In particular,
search community online
hassocial networking since
been established and microblogging
then with the
platform
hope Twitter has seen
of understanding exponential
interactions growth in
on Twitter. Foritsexam-
user
basestudies
ple, since itshaveinception in 2006 in
been conducted with
manynowdomains
over 200 million
exploring Figure 1: An abstraction of a Twitter data management
monthly active
different users producing
perspectives 500 million
of understanding human tweets (Twitter-
behavior.
1 platform
sphere,research
Prior the postings
focusesmade to Twitter)
on a variety . A wide
dailyincluding
of topics re-
opin-
search
ion miningcommunity has event
[12, 14, 38], been established
detection [46,since
65, then with the
76], spread of In our survey of existing literature we observed ways in
hope of understanding
pandemics interactions
[26,58,68], celebrity on Twitter.
engagement [74] andForanalysis
exam-
which researchers have tried to develop general platforms
ple,political
of studies discourse
have been[28, conducted
40, 70]. in many
These domains
types exploring
of efforts have Figure 1: aAn abstraction of a Twitter data management
to provide repeatable foundation for Twitter data analyt-
different perspectivestoofunderstand
enabled researchers understanding human behavior.
interactions on Twitter platform
ics. We review the tweet analytics space primarily focusing
Prior
relatedresearch focuses
to the fields on a varietyeducation,
of journalism, of topics including
marketing,opin- dis- on the following key elements:
ion mining
aster [12, 14, 38], event detection [46, 65, 76], spread of
relief etc. In our survey of existing literature we observed ways in
pandemics [26,58,68], celebrity engagement [74] and analysis Data Collection. Researchers have several options for
1 which researchers have tried to develop general platforms
ofhttp://tnw.to/s0n9u
political discourse [28, 40, 70]. These types of efforts have collecting a suitable data set from Twitter. In Section 2
to provide a repeatable foundation for Twitter data analyt-
enabled researchers to understand interactions on Twitter we briefly describe mechanisms and tools that focus pri-
ics. We review the tweet analytics space primarily focusing
related to the fields of journalism, education, marketing, dis- marily on facilitating the initial data acquisition phase.
on the following key elements:
aster relief etc. These tools systematically capture the data using any of
Data Collection. Researchers have several options for
1
http://tnw.to/s0n9u collecting a suitable data set from Twitter. In Section 2
we briefly describe mechanisms and tools that focus pri-
marily on facilitating the initial data acquisition phase.
These tools systematically capture the data using any of

SIGKDD Explorations Volume 16, Issue 1 Page 11


the Twitters publicly accessible APIs. stream) of the public Twitter stream in real-time. Applica-
Data management frameworks. In addition to pro- tions where this rate limitation is too restrictive rely on third
viding a module for crawling tweets, these frameworks party resellers like GNIP, DataSift or Topsy 3 , who provides
provide support for pre-processing, information extrac- access to the entire collection of the tweets known as the
tion and/or visualization capabilities. Prepossessing deals Twitter FireHose. At a cost, resellers can provide unlim-
the Twitters publicly accessible APIs. stream)
ited access of the public Twitter
to archives streamdata,
of historical in real-time.
real-timeApplica-
stream-
with preparing tweets for data analysis. Information ex-
traction
Data management frameworks. tions where this rate limitation is too
ing data or both. It is mostly the corporate 3businessesrestrictive rely on third
who
aims to derive more insightInfrom addition
the to pro-
tweets,
viding party
opt forresellers like GNIP,toDataSift
such alternatives gain insightsor Topsy , whoconsumer
into their provides
which isa not module for crawling
directly reported tweets, these frameworks
by the Twitter API, e.g.
provide support access to the entire
and competitor collection of the tweets known as the
patterns.
a sentiment for afor givenpre-processing,
tweet. Several information
frameworks extrac-
pro-
tion and/or visualization capabilities. Prepossessing deals Twitter FireHose.
In order to obtain a dataset At a cost,sufficient
resellersfor cananprovide
analysisunlim-
task,
vide functionality to present the results in many output
with ited access to archives of historical data, real-time APIstream-
formspreparing tweets for
of visualizations data
while analysis.
others Information
are built exclusivelyex- it is necessary to efficiently query the respective meth-
traction aims to derive more insight from the tweets, ing
ods,data
within or the
both. It is mostly
bounds of imposedthe corporate
rate limits. businesses
The requestswho
to search over a large tweet collection. In Section 3 we
which opt for such alternatives to gain insights into their consumer
review isexisting
not directly reported by frameworks.
data management the Twitter API, e.g. to the API may have to run continuously spanning across
a sentiment for a given tweet. Several frameworks pro- and competitor
several days or weeks. patterns.Creating the users social graph for
Languages for querying tweets. A growing body In order to obtain a dataset sufficient for an analysis
vide functionality to present the results in many output a community of interest requires additional modules task,that
of literature proposes declarative query languages as a it is necessary to efficiently query thecrawls
respective
forms of visualizations while others are built exclusively crawl user accounts iteratively. Large with API
moremeth-
com-
mechanism of extracting structured information from tweets. ods,
to search over a large tweet collection. In Section 3 we pletewithin
coverage thewasbounds
madeofpossible
imposedwith ratethelimits.
use of The requests
whitelisted
Languages present end users with a set of primitives ben- to
review existing data management frameworks. accounts [21, 45] and using the computation power of across
the API may have to run continuously spanning cloud
eficial in exploring the Twittersphere along different di- several days[55]. or weeks.
Languages computing Due to Creating the users
Twitters current socialwhitelisted
policy, graph for
mensions. Infor querying
Section tweets. Adeclarative
4 we investigate growing body lan- a community of interest requires
of literature proposes declarative query accounts are discontinued and areadditional
no longer modulesan option that
as
guages and similar systems developed for languages
querying aasva- a
crawl user accounts
mechanism of extracting means of large dataiteratively.
collection.Large crawls with
Distributed more com-
systems have
riety of tweet properties.structured information from tweets. plete coverage was
Languages present end users with a set of primitives ben- been developed [16,made
45] topossible with the userunning,
make continuously of whitelisted
large
We eficial
have in identified thetheessential ingredients accounts
scale crawls feasible. There are other solutions that of
[21, 45] and using the computation power cloud
exploring Twittersphere alongfor a unified
different di- provide
Twitter computingfeatures[55]. Due to to Twittersthe current policy, whitelisted
mensions. In Section 4 we investigate declarative that
data management solution, with the intention lan- extended process incoming Twitter data as
an analyst will similar
easily be able to extend its accounts
discussed are discontinued and are no longer an option as
guages and systems developed for capabilities
querying a va- for next.
specific means of large data collection. Distributed systems have
riety types
of tweet of properties.
research. Such a solution will allow the
data analyst to focus on the use cases of the analytics task been developed [16, 45] to make continuously running, large
We conveniently
by have identified using thethe essential ingredients
functionality providedfor bya unified
an in- 3.
scale DATA MANAGEMENT
crawls feasible. There are otherFRAMEWORKS
solutions that provide
Twitter
tegrated data management
framework. solution,
In Section 5 wewith the intention
present that
our position extended features to process the incoming Twitter data as
an analyst will easily be able to extend
and emphasize the need for integrated solutions that address its capabilities for 3.1
discussedFocused
next. Crawlers
specific types
limitations of research.
of existing systems.SuchSection
a solution will allow
6 outlines the
research The focus in studies like TwitterEcho [16] and Byun et
data
issuesanalyst to focus associated
on the usewith casestheof development
the analyticsoftask al. [19] is data collection, where the primary contributions
by
and challenges
tegrated platforms. Finally we conclude in Sectionby7. an in-
conveniently using the functionality provided
in-
3. DATA MANAGEMENT FRAMEWORKS
are driven by crawling strategies for effective retrieval and
tegrated framework. In Section 5 we present our position better coverage. TwitterEcho describes an open source dis-
and emphasize the need for integrated solutions that address 3.1 Focused Crawlers
tributed crawler for Twitter. Data can be collected from
2. OPTIONS
limitations of existing FOR DATA
systems. COLLECTION
Section 6 outlines research The focus in studies like
a focused community TwitterEcho
of interest and it [16]adapts anda Byun
central-et
Researchers
issues have several
and challenges optionswith
associated when thechoosing an API
development of for
in- al. [19] is data collection, where the primary
ized distributed architecture in which multiple thin clients contributions
data collection,
tegrated platforms. i.e. Finally
the Search, Streaming
we conclude and the7.REST
in Section are deployed
are driven byto crawling
create astrategies
scalable for effective
system. retrieval and
TwitterEcho de-
API. Each API has varying capabilities with respect to the better
vises a coverage.
user expansion TwitterEcho
strategydescribes
by whichanthe open
userssource dis-
follower
type and the amount of information that can be retrieved. tributed crawler iteratively
for Twitter. Data thecan be collected
API. Thefrom
2. OPTIONS FOR DATA COLLECTION
The Search API is dedicated for running searches against an
lists are crawled
a focused community
using
of interest andfor
REST
it controlling
adapts a central-
sys-
tem also includes modules responsible the ex-
Researchers
index of recent have several
tweets. It options when choosing
takes keywords as queries an with
API the
for ized distributed
pansion strategy.architecture in whichfeature
The user selection multiple thin clients
identifies user
data collection,
possibility i.e. the
of multiple Search,
queries Streaming
combined as aand the REST
comma sepa- are deployed
accounts to be to monitored
create a scalable
by the system.
system with TwitterEcho
modules de- for
API.
rated Each
list. A API has varying
request capabilities
to the search with respect
API returns to the
a collection vises a user expansion
user profile analysis and strategy
language by which the usersThe
identification. follower
user
type
of and the
relevant amount
tweets of information
matching that can be retrieved.
a user query. lists are crawled
selection feature iteratively
is customized usingtothe REST
crawl the API. Thecom-
focused sys-
The
The Search
Streaming APIAPI is dedicated
providesfor runningtosearches
a stream against
continuously an
cap- tem alsoofincludes
munity Portuguese modules responsible
tweets but can for be controlling
adapted to the ex-
target
index of recent tweets. It takes keywords
ture the public tweets where parameters are provided to as queries with the pansion strategy. The
other communities. Byun useret selection
al. [19] infeature
their workidentifies
propose usera
possibility of multiple
filter the results of the queries
stream combined
by hashtags, as keywords,
a comma sepa- twit- accounts
rule-basedtodata be monitored
collection tool by the system with
for Twitter with modules
the focusfor of
rated
ter user ids, usernames or geographic regions. aThe
list. A request to the search API returns collection
REST user profilesentiment
analysing analysis of and language
Twitter identification.
messages. The user
It is a java-based
of relevant
API can betweets used to matching
retrievea auser query.of the most recent
fraction selection feature is customized to crawl
open source tool developed using the Drools rule engine. the 4focused com-
The
tweets Streaming
publishedAPI by aprovides
Twitter auser.streamAll to continuously
three APIs limitcap- the munity
They stressof Portuguese
the importance tweetsofbut an can be adapted
automated data to target
collector
ture the public tweets where parameters
number of requests within a time window and rate-limits are provided are to other communities.
that also Byun et al. data
filters out unnecessary [19] insuchtheir work messages.
as spam propose a
filter the
posed at results
the userof and the stream by hashtags,
the application level.keywords,
Responsetwit- ob- rule-based data collection tool for Twitter with the focus of
ter userfrom
tained ids, Twitter
usernames APIorisgeographic
generally in regions.
the JSON Theformat.
REST 3.2
analysingPre-processing
sentiment of Twitter andmessages.
Information Extrac-
It is a java-based
API can be used
Third party libraries to 2retrieve a fraction of the most recent
are available in many programming tion tool developed using the Drools4 rule engine.
open source
tweets published by a Twitter user. API.
All three
TheseAPIs limitpro-
the They stress
languages for accessing the Twitter libraries Apart from the dataimportance
collection, of an automated
several frameworks data collector
implement
number of requestsand within
providea methods
time window and rate-limitsand are that also filters out unnecessary data such as spam
vide wrappers for authentication methods to perform extensive pre-processing andmessages.
informa-
posed at the user
other functions and the application
to conveniently access thelevel.
API. Response ob-
tion extraction of the tweets. Pre-processing tasks of Trend-
tained
Publicly from Twitter
available APIsAPI doisnot
generally
guarantee in the JSONcoverage
complete format. 3.2 Pre-processing and Information Extrac-
Miner [61] take into account the challenges posed by the
Third party libraries 2
are available in
of the data for a given query as the feeds are not designed many programming tion
noisy genre of tweets. Tokenization, stemming and POS
languages
for enterprise for accessing
access. For theexample,
Twitter API. These libraries
the streaming pro-
API only Apart from data collection, several frameworks implement
vide wrappers and provide 3
provides a random sample methods for authentication
of 1% (known as the Spritzer and http://gnip.com/,
methods http://datasift.com/,
to perform extensive http:
pre-processing and informa-
other functions to conveniently access the API. //about.topsy.com/
tion extraction of the tweets. Pre-processing tasks of Trend-
2 4
https://dev.twitter.com/docs/twitter-libraries
Publicly available APIs do not guarantee complete coverage http://drools.jboss.org/
Miner [61] take into account the challenges posed by the
of the data for a given query as the feeds are not designed noisy genre of tweets. Tokenization, stemming and POS
for enterprise access. For example, the streaming API only
3
provides a random sample of 1% (known as the Spritzer http://gnip.com/, http://datasift.com/, http:
//about.topsy.com/
2 4
https://dev.twitter.com/docs/twitter-libraries http://drools.jboss.org/

SIGKDD Explorations Volume 16, Issue 1 Page 12


tagging are some of the text processing tasks that better phasis on selection of a proper data set for the definition of
prepare tweets for the analysis task. The platform pro- a campaign. The platform consists of three layers; campaign
vides separate built-in modules to extract information such crawling layer, integrated modeling layer and the data anal-
as location, language, sentiment and named entities that ysis layer. In the campaign crawling layer, a configuration
are deemed very useful in data analytics. The creation of a module follows an iterative approach to ensure the cam-
tagging
pipeline are sometools
of these of theallowstextthe processing
data analyst tasks to that
extend better
and phasis on selection
paign converges to of a proper
a proper setdata set for(keywords).
of filters the definition of
Col-
prepare tweets for the analysis
reuse each component with relative ease. task. The platform pro- a campaign. The platform consists of
lected tweets, metadata and the community data (relation- three layers; campaign
vides separate
TwitIE [17] is built-in
another modules
open-source to extract information
information extractionsuch crawling
ships among layer, integrated
Twitter users) modeling
are stored layer in and
a graph the data anal-
database.
as
NLP location,
pipelinelanguage,
customized sentiment
for microblogand namedtext. For entities that
the pur- ysis
Thislayer.
study In the campaign
should be highlighted crawling
for itslayer, a configuration
distinction to allow
are
pose of information extraction (IE), the general purposeofIE
deemed very useful in data analytics. The creation a module follows
for a flexible an iterative
querying mechanism approachon top to ensure
of a data themodel
cam-
pipeline of
pipeline these tools
ANNIE is used allows
and the data analyst
it consists to extendsuch
of components and paignon
built converges
raw data. to The
a proper
modelset of filters (keywords).
is generated in the integrated Col-
reuse each component
as sentence splitter, POS with relative
tagger and ease.
gazetteer lists (for loca- lected
modeling tweets,
layermetadata
and comprisesand thea community
representation dataof(relation-
associa-
TwitIE [17] is another open-source
tion prediction). Each step of the pipeline information extraction
addresses draw- ships between
tions among Twitter terms (e.g.users) are stored
hashtags) used in in
a graph
tweets database.
and their
NLP pipeline
backs in traditionalcustomized
NLP systemsfor microblog text. For
by addressing the the pur-
inherent This studyinshould
evolution time. be highlighted
Their approach forisits distinctionastoitallow
interesting cap-
pose of information
challenges in microblog extraction
text. As (IE), the general
a result, purpose
individual com- IE for
turesa the
flexible
oftenquerying
overlooked mechanism
temporalon top of a In
dimension. datathemodel
third
pipeline of
ponents ANNIE
ANNIE is are
usedcustomized.
and it consists of components
Language identification,such built analysis
data on raw data. layer,The model
a query is generated
language is usedintothe integrated
design a tar-
as sentence splitter,
tokenisation, POS tagger
normalization, POS and gazetteer
tagging and lists
named (forentity
loca- modeling
get view oflayer and comprises
the campaign data athat representation
corresponds of to associa-
a set of
tion prediction). Each step of the pipeline
recognition is performed with each module reporting accu- addresses draw- tions between
tweets that containterms (e.g. hashtags)
for example, the used in tweets
answer to anand their
opinion
backson
racy in tweets.
traditional NLP systems by addressing the inherent evolution in time. Their approach is interesting as it cap-
mining question.
challenges
Baldwin [11] in presents
microblog text. As
a system a result,
designed for individual
event detectioncom- tures
Whilethe often overlooked
including components temporal
for capture dimension.
and storage In the third
of tweets,
ponents of ANNIE are customized. Language
on Twitter with functionality for pre-processing. JSON re- identification, data
additional tools have been developed to search through tar-
analysis layer, a query language is used to design a the
tokenisation,
sults returnednormalization,
by the Streaming POSAPI tagging
are andparsed named
and entity
piped get view of
collected the campaign
tweets. data that
The architecture of corresponds
CoalMine [72] to presents
a set of
recognition is performed with each module
through language filtering and lexical normalisation compo- reporting accu- tweets
a socialthatnetwork containdatafor example,
mining system thedemonstrated
answer to anon opinion
Twit-
racy on tweets.
nents. Messages that do not have location information are mining
ter, question.
designed to process large amounts of streaming social
Baldwin [11] using
geo-located, presents a system designed
probabilistic models since for eventits detection
a critical While
data. Theincluding
ad-hoc components
query toolfor capturean
provides and end storage of tweets,
user with the
on
issueTwitter with functionality
in identifying where an event for pre-processing.
occurs. Information JSON ex- re- additional tools have been developed
ability to access one or more data files through a Google- to search through the
sults returned
traction modules by the Streaming
require knowledge API from
are parsed
external andsources
piped collected
like searchtweets.
interface.The Appropriate
architecture of CoalMine
support [72] presents
is provided for a
through language filtering
and are generally and lexical
more expensive tasksnormalisation
than language compo-
pro- a
setsocial network
of boolean anddata mining
logical system demonstrated
operators for ease of querying on Twit-on
nents.
cessing. Platforms that support real-time analysis [11, are
Messages that do not have location information 76] ter, of
top designed
a standard to process
Apache largeLucene amounts
index. of The streaming social
data collection
geo-located,
require using probabilistic
processing tasks to be conducted models since its a critical
on-the-fly where data. The ad-hoc
and storage component query is tool provides an
responsible for end user withcon-
establishing the
issue in identifying
the speed where analgorithms
of the underlying event occurs. Information
is a crucial consider- ex- ability to access one or more data
nections to the REST API and to store the JSON objects files through a Google-
traction modules require knowledge from external sources
ation. like search
returned ininterface.
compressed Appropriate
formats. support is provided for a
and are generally more expensive tasks than language pro- set of boolean
In building and logical
support platforms operators for ease to
it is necessary of querying
make provi- on
3.3 Generic
cessing. Platforms Platforms
that support real-time analysis [11, 76] top of a standard Apache Lucene index.
sions for practical considerations such as processing big data. The data collection
require processing tasks to be conducted on-the-fly where and storage component
TrendMiner [61] facilitates is responsible
real-time analysisfor establishing
of tweets con-and
There are several proposals in which researchers have tried
the speed of the underlying algorithms is a crucial consider- nections
takes intotoconsideration
the REST API and to and
scalability store the JSON
efficiency objects
of process-
to develop generic platforms to provide a repeatable foun-
ation. returned
ing large in compressed
volumes formats.
of data. TrendMiner makes an effort to
dation for Twitter data analytics. Twitter Zombie [15] is a
In building
unify some support platforms
of the existing textit processing
is necessarytools to make provi-
for Online
platform to unify the data gathering and analysis methods
3.3 Generic Platforms
by presenting a candidate architecture and methodological
sions
Socialfor practical considerations
Networking (OSN) data, with such as processing
emphasis big data.
on adapting
There are several proposals in which TrendMiner [61] facilitates real-time analysisbatches of tweets and
approach for examining specific parts researchers have tried
of the Twittersphere. to real-life scenarios that include processing of mil-
to develop generic platforms to provide a repeatable foun- takes
lions ofintodata.
consideration scalability
They envision the and
system efficiency
to be of process-
developed
It outlines architecture for standard capture, transforma-
dation foranalysis
Twitterofdata analytics. Twitter Zombie [15] is a ing large batch-mode
for both volumes of data. and online TrendMiner
processing. makes an effort
TwitIE to
[17] as
tion and Twitter interactions using the Twitters
platform to unify the data gathering and analysis methods unify someinofthe
discussed the existingsection
previous text processing
is anothertools for Online
open-source in-
Search API. This tool is designed to gather data from Twit-
by presenting a candidate architecture and methodological Social
formation Networking
extraction (OSN) data, with
NLP pipeline emphasisfor
customized onmicroblog
adapting
ter by executing a series of independent search jobs on a
approach to real-life scenarios that include processing batches of mil-
continual for basisexamining specific parts
and the collected tweets of and
the their
Twittersphere.
metadata text.
It outlines architecture for standard capture, transforma- lions of data. They envision the system to be developed
is kept in a relational DBMS. One of the interesting features
tion
of and analysis ofisTwitter
TwitterZombie its abilityinteractions
to capture using the Twitters
hierarchical rela- 3.4
for both Application-specific
batch-mode and online Platforms processing. TwitIE [17] as
Search API. Thisdata
toolreturned
is designed to gatherAdata from trans-
Twit- discussed in the previous section is another open-source in-
tionships in the by Twitter. network Apart from the above mentioned general purpose platforms,
ter formation extraction NLP pipeline customized for microblog
latorbymodule
executing a series
performs of independentonsearch
post-processing jobs on
the tweets anda there are many frameworks targeted at conducting specific
continual basis and the collected tweets separately
and their metadata text.
stores hashtags, mentions and retweets, from the types of analysis with Twitter data. Emergency Situation
is kepttext.
tweet in a relational
Raw tweets DBMS. One of the interesting
are transformed features
into a representa- Awareness (ESA) [76] is a platform developed to detect, as-
of TwitterZombie
tion of interactionsistoitscreate ability to capture
networks hierarchical
of retweets, rela-
mentions 3.4 Application-specific Platforms
sess, summarise and report messages of interest published
tionships
and usersinmentioning
the data returned
hashtags. by Twitter.
This feature A network
captured trans-by Apart
on from the
Twitter for above mentioned general
crisis coordination tasks.purpose platforms,
The objective of
lator module performs
TwitterZombie, which other post-processing
studies pay on theattention
little tweets and to, there are many
their work is to frameworks
convert large targeted
streamsatofconducting
social media specific
data
stores
is hashtags,
helpful mentions
in answering and retweets,
different types ofseparately from the
research questions typesuseful
into of analysis
situation with Twitterinformation
awareness data. Emergency in real-time.Situation
The
tweet text. Raw
with relative ease.tweets
Socialare transformed
graphs are created intoina therepresenta-
form of Awareness
ESA platform (ESA) [76] isofa modules
consists platformto developed to detect,con-
detect incidents, as-
tion of interactions
a retweet or mention tonetwork
create networks
and theyofdoretweets,
not crawl mentions
for the sess, summarise and report messages
dense and summarise messages, classify messages of high of interest published
and
user users
graphmentioning
with traditional hashtags.followingThisrelationships.
feature captured It alsoby on Twitter
value, identify forand crisis coordination
track issues and tasks.
finally to The objective
conduct of
foren-
TwitterZombie, which other studies pay
draws discussion on how multi-byte tweets in languages like little attention to, their work isoftohistorical
sic analysis convert largeevents. streams of socialare
The modules media data
enriched
is helpful
Arabic or in answering
Chinese can be different
storedtypes of researchtranslitera-
by performing questions into
by a useful
suite of situation awareness
visualisation information
interfaces. Baldwin in real-time. The
et al. [11] pro-
with
tion. relative ease. Social graphs are created in the form of ESA another
pose platformsupport consistsplatform
of modules to detect
focused incidents,
on detecting con-
events
a retweet
More or mention
recently, TwitHoardnetwork [69]and they do
suggests not crawl of
a framework forsup-
the dense
on and summarise
Twitter. The Twitter messages,
stream isclassify
queried messages
with a setof of high
key-
user graph
porting with traditional
processors for data following
analytics relationships.
on Twitter with It also
em- value, specified
words identify and track
by the user issues
withandthe finally
objective to conduct
of filteringforen-
the
draws discussion on how multi-byte tweets in languages like sic analysis of historical events. The modules are enriched
Arabic or Chinese can be stored by performing translitera- by a suite of visualisation interfaces. Baldwin et al. [11] pro-
tion. pose another support platform focused on detecting events
More recently, TwitHoard [69] suggests a framework of sup- on Twitter. The Twitter stream is queried with a set of key-
porting processors for data analytics on Twitter with em- words specified by the user with the objective of filtering the

SIGKDD Explorations Volume 16, Issue 1 Page 13


stream on a topic of interest. The results are piped through management frameworks concentrate on a set of challenges
text processing components and the geo-located tweets are more than others.
visualised on a map for better interaction. Clearly, platforms We aim to recognize the key ingredients of an integrated
of this nature that deal with incident exploration need to framework that takes into account shortcomings of existing
make provisions for real-time analysis of the incoming Twit- systems.
stream
ter streamon aand topic of interest.
produce The visualizations
suitable results are piped through
of detected management frameworks concentrate on a set of challenges
text processing
incidents. components and the geo-located tweets are more than others.
visualised on a map for better interaction. Clearly, platforms 4.
We aimLANGUAGES
to recognize the FOR key QUERYING
ingredients of anTWEETS integrated
3.5this Data
of natureModel that deal and withStorage
incidentMechanisms
exploration need to framework
The goal ofthat takes into
proposing account shortcomings
declarative languages and of systems
existing
make provisions for real-time analysis of the incoming Twit- systems.
for querying tweets is to put forward a set of primitives
Data models are not discussed in detail in most studies,
ter stream and produce suitable visualizations of detected or an interface for analysts to conveniently query specific
as a simple data model is sufficient to conduct basic form
incidents. interactions on Twitter exploring the user, time, space and
of analysis. When standard tweets are collected, flat files
[11, 72] is the preferred choice. Several studies that cap-
4. LANGUAGES FOR QUERYING TWEETS
topical dimensions. High level languages for querying tweets
3.5 Data Model and Storage Mechanisms
ture the social relationships [15, 19] of the Twittersphere, The goal
extend of proposing
capabilities declarative
of existing languages
languages such andas SQLsystems
and
Data models
employs are not discussed
the relational data model in detail
but doinnot most studies,
necessarily for querying
SPARQL. tweets
Queries areiseither
to put forward
executed on athe setTwitter
of primitives
stream
as a simple
store data model inis asufficient
the relationships to conductAs
graph database. basic form
a conse- or real-time
in an interface or on fora analysts to conveniently
stored collection of tweets. query specific
of analysis.
quence, manyWhen analyses standard
that can tweets are collected,
be performed flat files
conveniently interactions on Twitter exploring the user, time, space and
[11,
on a72]graph
is theare preferred
not capturedchoice. bySeveral studies that Only
these platforms. cap- 4.1 Generic
topical dimensions. Languages
High level languages for querying tweets
ture the social relationships [15, 19] of
TwitHoard [69] in their paper models co-occurrence of terms the Twittersphere, extend capabilities of existing languages such as SQL and
TweeQL [49] provides a streaming SQL-like interface to the
employs
as a graph thewithrelational
temporallydata model
evolving butproperties.
do not necessarily
Twitter SPARQL. Queries are either executed on the Twitter stream
Twitter API and provides a set of user defined functions
store
Zombie the[15]relationships
and TwitHoard in a graph database.
[69] should As a conse-
be highlighted for in real-time or on a stored collection of tweets.
(UDFs) to manipulate data. The objective is to introduce
quence, many
capturing analyses including
interactions that can be theperformed
retweets and conveniently
term as- a query language to extract structure and useful informa-
on a graphapart
sociations are from
not captured by these
the traditional platforms. social
follower/friend Only 4.1 Generic Languages
tion that is embedded in unstructured Twitter data. The
TwitHoard [69] in their paper models co-occurrence
relationships. TrendMiner [61] draws explicit discussion on of terms TweeQL
language [49] provides
exploits botharelational
streamingand SQL-like
streaminginterface to the
semantics.
as a graph
making with temporally
provisions for processing evolving
millionsproperties.
of data and Twitter
takes Twitter API and provides a set of user
UDFs allow for operations such as location identification, defined functions
Zombie
advantage [15]of and
Apache TwitHoard [69] should be
Hadoop MapReduce highlighted
framework to per-for (UDFs)processing,
string to manipulate data. prediction,
sentiment The objective is to entity
named introduceex-
capturing interactions including the retweets
form distributed processing of the tweets stored as key-value and term as- a query language
traction and eventtodetection.
extract structure
In the spirit and ofuseful informa-
streaming se-
sociations
pairs. apart from
CoalMine the traditional
[72] also has Apachefollower/friend
Hadoop at thesocial core tion thatitisprovides
mantics, embedded SQLinconstructs
unstructured Twitteraggregations
to perform data. The
relationships. TrendMiner
of their batch processing [61] drawsresponsible
component explicit discussion
for efficient on language exploits both
over the incoming stream relational and streaming
on a user-specified time semantics.
window.
making provisions
processing of large for processing
amount of data.millions of data and takes UDFs allow for operations such as location
The result of a given query can be stored in a relational identification,
advantage of Apache Hadoop MapReduce framework to per- string processing,
fashion for subsequent sentiment
querying. prediction, named entity ex-
3.6 distributed
form Supportprocessing for Visualization
of the tweetsInterfaces
stored as key-value traction
Models for andrepresenting
event detection. In the network
any social spirit of in streaming
RDF have se-
pairs. CoalMine [72] also has Apache Hadoop at the core mantics, it provides SQL constructs to perform aggregations
There are many platforms designed with integrated tools been proposed by Martin and Gutierrez [50] allowing queries
of their batch processing component responsible for efficient over the incoming stream on a user-specified
predominantly for visualization, to analyse data in spatial, in SPARQL. The work explores the feasibilitytime window.
of adoption
processing of large amount of data. The result of bya given query cantheir be stored in aanrelational
temporal and topical perspectives. One tool is tweetTracker of this model demonstrating idea with illustra-
[44], which is designed to aid monitoring of tweets for hu- fashion for subsequent
tive prototype but doesquerying.
not focus on a single social network
3.6 Support for Visualization Interfaces
manitarian and disaster relief. TweetXplorer [54] also pro- Models
like for representing
Twitter in particular. anyTwarQL
social network in RDF
[53] extracts have
content
There are many
vides useful platforms
visualization designed
tools to explorewithTwitter
integrateddata.toolsFor been
from proposed
tweets and byencodes
Martin it andin Gutierrez
RDF format [50]using
allowing queries
shared and
predominantly
a particular campaign,for visualization, to analyse
visualizations data in spatial,
in tweetXplorer help in SPARQL.
well The work explores
known vocabularies (FOAF,the feasibility
MOAT, SIOC) of adoption
enabling
temporal
analysts to andviewtopical perspectives.
the data One tool
along different is tweetTracker
dimensions; most of this model
querying by demonstrating
in SPARQL. The extraction theirfacility
idea with an illustra-
processes plain
[44], which is designed to aid monitoring
interesting days in the campaign (when), important of tweets forusers hu- tive prototype but does not focus on
tweets and expands its description by adding sentiment a single social networkan-
manitarian
and and disaster
their tweets (who/what) relief.
and TweetXplorer
important locations[54] alsoinpro- the like Twitter
notations, in particular.
DBPedia entities, TwarQL [53] extracts
hashtag definitions andcontent
URLs.
vides
dataset useful visualization
(where). Systemstools to explore[48],
like TwitInfo Twitter data. For
Twitcident [8] from tweets and encodes
The annotation of tweets it in RDFdifferent
using format using shared and
vocabularies en-
a
andparticular
Torettorcampaign,
[65] also providevisualizations
a suiteinoftweetXplorer
visualisationhelp ca- well
ables querying and analysis in different dimensionsenabling
known vocabularies (FOAF, MOAT, SIOC) such as
analysts
pabilitiestotoview explorethe tweets
data along different
in different dimensions;
dimensions most
relating querying
location, in SPARQL.
users, sentimentThe andextraction
relatedfacility
namedprocesses
entities. plain
The
interesting days in the campaign (when),
to specific applications like fighting fire and detecting earth- important users tweets and expands its description by
infrastructure of TwarQL enables subscription to a stream adding sentiment an-
and theirWeb-mashups
quakes. tweets (who/what) and important
like Trendsmap [5] andlocations
Twitalyzer in the[6] notations,
that matches DBPedia
a given entities,
query and hashtag
returnsdefinitions
streamingand URLs.
annotated
dataset
provide (where).
a web interface Systems and like TwitInfobusiness
enterprise [48], Twitcident
solutions [8] to The
data annotation
in real-time.of tweets using different vocabularies en-
and
gain Torettor
real-time [65] trend also
andprovide
insightsa ofsuite
userofgroups.
visualisation ca- ables querying
Temporal and analysis
and topical features inare
different dimensions
of paramount such as
importance
pabilities
Table to explorean
1 illustrates tweets in different
overview of relateddimensions
approaches relatingand location,
in users,microblogging
an evolving sentiment and related
stream likenamed
Twitter. entities.
In the The
lan-
to specific
features of applications
different platforms.like fighting fire and detecting
Pre-processing in Tableearth-1 in- infrastructure
guages above, time of TwarQL
and topic enables subscription
of a tweet (topic can to be
a stream
repre-
quakes.
dicates ifWeb-mashups
any form of language like Trendsmap
processing [5] and
tasksTwitalyzer
such as POS [6] that
sentedmatches
simplyaby given query and
a hashtag) arereturns
considered streaming
meta-dataannotated
of the
provide
tagging or a web interface and
normalization enterprise business
are conducted. Information solutions
extrac- to data
tweetinandreal-time.
is not treated any different from other metadata
gain real-time trend and insights of user
tion refers to the types of post processing performed to infergroups. Temporal
reported. and Topics topical features are
are regarded of paramount
as part of the tweet importance
content
Table 1 illustrates
additional information, an overview of relatedorapproaches
such as sentiment named entities and in an
or whatevolving
drivesmicroblogging
the data filtering streamtasklikefrom Twitter. In theAPI.
the Twitter lan-
features of different
(NEs). Multiple ticksplatforms. Pre-processing
() correspond to a task in that
Tableis 1car-
in- guages above,
There have beentime and topic
efforts of a tweet
to exploit features (topic
that cangobe repre-
well be-
dicates if any form of language processing
ried out extensively. In addition to collecting tweets, some tasks such as POS sented
yond asimply
simpleby a hashtag)
filter based on aretimeconsidered
and topic. meta-data of the
Plachouras
tagging also
studies or normalization
capture the users are conducted.
social graph Information
while others extrac-
pro- tweetStavrakas
and and is not[59] treated
stressany thedifferent
need forfrom other modelling
temporal metadata
tion
pose refers
the need to theto types
regardof interactions
post processing performedretweets,
of hashtags, to infer reported.
of terms inTopicsTwitter aretoregarded
effectively as part
captureof the tweet content
changing trends.
additional
mentions asinformation, such as sentiment
separate properties. Backend or datanamed
models entities
sup- or what refers
A term drives to theany
data filtering
word task phrase
or short from the of Twitter
interest API.
in a
(NEs). by
ported Multiple ticks ()
the platform shapecorrespond
the typestoofaanalysis
task that thatis car-
can There have
tweet, been efforts
including hashtags to orexploit
output features
of anthat entitygo recogni-
well be-
ried out extensively.
be conveniently doneIn onaddition to collecting
each framework. Fromtweets,
the some
sum- yond a simple Their
tion process. filter proposed
based on query time and topic. can
operators Plachouras
express
studiesinalso
mary Tablecapture
1, wethe canusers social
observe graph
that each while
study others
on datapro- and Stavrakas
complex queries[59] forstress the need
associations for temporal
between terms over modelling
vary-
pose the need to regard interactions of hashtags, retweets, of terms in Twitter to effectively capture changing trends.
mentions as separate properties. Backend data models sup- A term refers to any word or short phrase of interest in a
ported by the platform shape the types of analysis that can tweet, including hashtags or output of an entity recogni-
be conveniently done on each framework. From the sum- tion process. Their proposed query operators can express
mary in Table 1, we can observe that each study on data complex queries for associations between terms over vary-

SIGKDD Explorations Volume 16, Issue 1 Page 14


Table 1: Overview of related approaches in data management frameworks.
Prepossessing Examples of Social and/or other Data Store
extracted information interactions captured?
TwitterEcho [16]  Language Yes Not given
Byun et al. [19] Location Yes Relational
Twitter Zombie [15] Table 1: Overview  of related approaches in data management Yesframeworks. Relational
TwitHoard [69] Prepossessing
 Examples of Social and/or Yes other Data
GraphStoreDB
CoalMine [72] extracted information interactions Nocaptured? Files
TwitterEcho
TrendMiner [61] [16]  Location, Language
Sentiment, NEs Yes
No Not given
Key-value pairs
Byun et
TwitIE [17] al. [19]  Location
Language, Location, NEs Yes
No Relational
Not given
Twitter
ESA [76]Zombie [15] 
 Location, NEs Yes
No Relational
Not given
TwitHoard
Baldwin et al. [11][69]  Language, Location Yes
No Graph DB
Flat files
CoalMine [72] No Files
TrendMiner [61]  Location, Sentiment, NEs No Key-value pairs
ing time TwitIE [17] to discover context
granularities,  of collected Language, data.Location,
graphNEs database to demonstrate No Not given
the feasibility of the model.
ESA [76] 
Operators also allow retrieving a subset of tweets satisfying Location, NEs
languages that operate on the twitter Not
No given
stream like TweeQL
Baldwin
these complex et al. [11]
conditions 
on term associations. ThisLanguage,
enables Location
and TwarQL generates No the output inFlat files TweeQL
real-time;
the end user to select a good set of terms (hashtags) that [49] allows the resulting tweets to be collected in batches
drive the data collection which has a direct impact on the then stores them in a relational database, while TwarQL [53]
ing timeofgranularities,
quality to discover
the results generated fromcontext of collected data.
the analysis. graph
at the database
end of the to information
demonstrateextractionthe feasibility
phase,of the model.
annotated
Operators
Spatial featuresalso allow retrieving
are another a subsetofoftweets
property tweetsoftensatisfying
over- languages
tweets are that encodedoperate in RDF.on the twitter stream like TweeQL
these
lookedcomplex
in complex conditions
analysis. on term associations.
Previously discussed This
workenables
uses and TwarQL generates the output in real-time; TweeQL
the
the end user attribute
location to select aasgood set of terms
a mechanism to (hashtags)
filter tweets that
in 4.3 Query
[49] allows the Languages
resulting tweets fortoSocial Networks
be collected in batches
drive the data collection which has
space. To complete our discussion we briefly outline twoa direct impact on the then stores them in a relational database,
To the best of our knowledge, there is no existing work while TwarQL [53]
quality
studies of that theuseresults generated
geo-spatial from thetoanalysis.
properties perform complex at the end of the information extraction
focusing on high level languages operating on the Twit- phase, annotated
Spatial
analysisfeatures
using theare another
location propertyDoytsher
attribute. of tweetsetoftenal. [31]over-
in- tweetssocial
ters are encoded
graph. inHowever RDF. it is important to note pro-
looked
troduced in acomplex
model and analysis.
queryPreviously
language suiteddiscussed work uses
for integrated posals for declarative query languages tailored for querying
the
datalocation
connecting attribute
a socialasnetworka mechanism
of users to withfilter tweetsnet-
a spatial in 4.3 Query Languages
social networks in general [10,for Social
30, 32, 50, 51,Networks
64]. One of the
space.
work to To complete
identify placesour discussion
visited we briefly
frequently. Edges outline
named life-two To the supported
queries best of ourareknowledge, path queries there is no existing
satisfying a set of work
con-
studies
patternsthat are useusedgeo-spatial
to associate properties
the social to perform
and spatial complex
net- focusing
ditions ononthe high
path, levelandlanguages
the languages operating on the
in general takeTwit-ad-
analysis
works. using Differentthe location attribute. Doytsher
time granularities et al. [31] for
can be expressed in- ters social
vantage graph. properties
of inherent However it is important
of social networks. to Semantics
note pro-
troduced
each visited a model
location andrepresented
query language by the suited for integrated
life-pattern edge. posals
of the for declarative
languages are query
based languages
on Datalog tailored for querying
[51], SQL [32, 64]
data
Even connecting
though thea implementation
social network ofemploys users with a spatial syn-
a partially net- social networks in general
or RDF/SPARQL [10, 30, 32, 50, 51,
[50]. Implementations 64].
are One of the
conducted on
work
thetictodataset,
identifyitplaceswill be visited frequently.
interesting Edges named
to investigate how life-
the queries supported
bibliographical are path
networks [32],queries
Facebooksatisfying a set ofsocial
and evolving con-
patterns
socio-spatial are networks
used to associate
and the the social and
life-pattern edgesspatial
that net-are ditions
content on sites thelike
path, Yahoo!and the
Travel languages
[10] andinare general take ad-
not tested on
works. Differentthe
used to associate time granularities
spatial and socialcan be expressed
networks can be rep- for vantage of
Twitter inherenttaking
networks propertiesTwitterof social networks.
specific Semantics
affordances into
each
resentedvisited
in a location
real socialrepresented
network dataset by thewith life-pattern edge.
location infor- of the languages are based on Datalog [51], SQL [32, 64]
consideration.
Even
mation, though
such the implementation
as Twitter. GeoScope employs a partially
[18] finds informationsyn- or RDF/SPARQL [50]. Implementations are conducted on
thetic
trends dataset,
by detecting it will be interesting
significant to investigate
correlations how the
among trending 4.4 Information
bibliographical networks Retrieval
[32], Facebook - Tweet Searchsocial
and evolving
socio-spatial
location-topicnetworkspairs in aand the life-pattern
sliding window. This edges
givesthatriseare
to content sites like Yahoo! Travel [10] and are not tested on
used to associateofthe spatial and Another class of systems presents textual queries to effi-
the importance capturing the social
notionnetworks
of spatial caninforma-
be rep- Twitter networks taking Twitter specific affordances into
resented in ainreal social networkindataset with location infor- ciently search over a corpus of tweets. The challenges in this
tion trends social networks analysis tasks. Real-time consideration.
mation, area are similar to that of information retrieval in addition
detectionsuch as Twitter.
of crisis events from GeoScope
a location[18] in
finds information
space, exhibits
with having to deal with peculiarities of tweets. The short
trends by detecting significant correlations
the possible value of Geoscope. In one of the experiments among trending 4.4
length Information
of tweets in particular Retrieval - Tweet
creates added Search
complexity to
location-topic
Twitter is usedpairs as aincasea sliding
study window. This gives
to demonstrate rise to
its useful- Another
text-based class
searchof systems
tasks as presents
it is textual
difficult to queries relevant
identify to effi-
the
ness,importance
where a hashtag of capturing
is chosen the tonotion of spatial
represent informa-
the topic and ciently matching
tweets search over a a corpus
user query of[13,35].
tweets. Expanding
The challenges tweetin con-
this
tion trendswhich
city from in social networks
the tweet in analysis
originates chosen tasks. Real-time
to capture the area
tent are
is similar to
suggested as that
a way oftoinformation
enhance the retrieval
meaning. in The
addition
goal
detection
location. of crisis events from a location in space, exhibits with
of suchhaving
systems to deal with peculiarities
is to express of tweets. need
a users information The in short
the
the possible value of Geoscope. In one of the experiments
lengthof of
form a tweets text
simple in particular
query, much creates
like inadded
search complexity
engines, and to
4.2 Data Model for the Languages
Twitter is used as a case study to demonstrate its useful-
text-based
return a search
tweet list tasks
in as it is difficult
real-time with to identify
effective relevant
strategies for
ness, where a hashtag is chosen to represent the topic and
Relational, RDF the andtweet
Graphs are the chosen
most common choices tweets matching
ranking and relevance a user measurements
query [13,35]. Expanding
[33, 34, 71]. tweet con-
Indexing
city from which originates to capture the
of data representation. There is a close affiliation in these tent is suggested
mechanisms are as a way to
discussed in enhance
[23] as the meaning.
they directly The goal
impact ef-
location.
data models observing that, for instance, a graph can easily of suchretrieval
ficient systems of is to express
tweets. The a users
TRECinformation
micro-bloggingneed track
in the5
correspond to a set of RDF triples or vice versa. In fact, form
is of a simple
dedicated text query,
to calling much like
participants to in searchreal-time
conduct engines, and ad-
4.2 Data Model
some studies for theand
like Plachouras Languages
Stavrakas [59] have put return
hoc searcha tweettaskslist overin areal-time
given tweet withcollection.
effective strategies
Publications for
Relational, RDF and Graphs are the most
forward their data model as a labeled multi digraph and have common choices ranking
of TREC and[56],relevance
documents measurements
the findings [33,
of 34,
all 71].
systems Indexing
in the
of data arepresentation.
chosen relational database Therefor is its
a close affiliation in None
implementation. these mechanisms
task of ranking are discussed
the most in [23] as tweets
relevant they directly
matching impact
a pre- ef-
5
data models
of these query observing
systems that, models for Twitter
instance,social
a graph can easily
network with ficient
definedretrieval
set of user of tweets.
queries.The TREC micro-blogging track
correspond
following or to a set relationships
retweet of RDF triples or vice
among versa.
users. In fact,
Doytsher et is dedicated
Table to calling
2 illustrates participants
an overview to conduct
of related real-time
approaches ad-
in sys-
some
al. [31]studies
implement like Plachouras
their algebraic and query
Stavrakas [59] have
operators with putthe hoc
temssearch tasks over
for querying a given
tweets. Data tweet collection.
models Publications
and dimensions in-
forward
use of both their dataand
graph model as a labeled
a relational multi digraph
database and have
as the underlying of TREC [56], documents the findings of all systems in the
chosen a relational
data storage. They database
experimentallyfor its compare
implementation.
relationalNone and task
5 of ranking the most relevant tweets matching a pre-
http://trec.nist.gov/
of these query systems models Twitter social network with defined set of user queries.
following or retweet relationships among users. Doytsher et Table 2 illustrates an overview of related approaches in sys-
al. [31] implement their algebraic query operators with the tems for querying tweets. Data models and dimensions in-
use of both graph and a relational database as the underlying
5
data storage. They experimentally compare relational and http://trec.nist.gov/

SIGKDD Explorations Volume 16, Issue 1 Page 15


Table 2: Overview of approaches in systems for querying tweets.
Data Model Explored dimensions
Relational RDF Graph Text Time Space Social Network Real-Time
TweeQL [49]     Yes
TwarQL [53]     Yes
Plachouras et al. [60] Table 2: Overview  of approaches in systems for
 querying tweets. No
Doytsher et al. [31]  Data Model   Explored
 dimensions
 No
GeoScope et al. [18] Relational RDF Graph Text  Time
 Space
  Social Network Real-Time Yes
TweeQL [49]
Languages on social networks 
     
  Yes
No
TwarQL [53]
Tweet search systems    
   Yes
Yes
Plachouras et al. [60]    No
Doytsher et al. [31]      No
GeoScope et al. [18]
vestigated in each system are depicted

in the Table 2. Sys-    
mation extraction considering the inherent peculiarities of Yes
tems Languages
that have made on social networks
provision


for the real-time 
streaming  tweets and not all  frameworks  we discussed provided No this
nature Tweet
of thesearch
tweetssystems
are indicated in the Real-time  column. 
functionality.  Pre- processing components  Yes
for example, nor-
Multiple ticks () correspond to a dimension explored in de- malization and tokenization should be implemented, with
tail. Note that the systems marked with an asterisk (*) are the option for the end users to customize the modules to suit
vestigated
not implementedin eachspecifically
system aretargeting
depicted tweets,
in the Table
though 2. their
Sys- mation extraction considering
their requirements. Informationthe inherentmodules
extraction peculiarities
such as of
tems that have
application made provision
is meaningful and can forbethe real-time
extended to streaming
the Twit- tweets and
location not all frameworks
prediction, we discussedand
sentiment classification provided
named this en-
nature
tersphere.of the
Wetweets
observe are indicated
potential forin the Real-time
developing column.
languages for functionality.
tity recognition Pre-
areprocessing componentsanalysis
useful in conducting for example,
on tweetsnor-
Multiple ticks () correspond to a dimension
querying tweets that include querying by dimensions that explored in de- malization and tokenization should
and attempt to derive more information from plain tweet be implemented, with
tail.
are notNote that thebysystems
captured existing marked
systems, with an asterisk
especially the(*) are
social the option
text and theirfor the end usersIdeally,
metadata. to customize
a userthe modules
should be ableto suit
to
not
graph.implemented specifically targeting tweets, though their their
integraterequirements.
any combination Information
of the extraction
components modules
into theirsuch ownas
application is meaningful and can be extended to the Twit- location
applications. prediction, sentiment classification and named en-
tersphere. We observe potential for developing languages for tity recognition are useful in conducting analysis on tweets
Data model:toMuch of the literature presented
from (Section 3.5
5. THE
querying NEED
tweets that FORinclude INTEGRATED
querying by dimensions SOLU- that and attempt
and 4.2) does
derive
not
more
emphasize
information
or draw explicit
plain tweet
discussions
are not captured by existing systems, especially the social text and their metadata. Ideally, a user should be able to
graph. TIONS on the data
integrate anymodel in use. ofThe
combination the logical data into
components model theirgreatly
own
There is a need to assimilate individual efforts with the goal influences
applications. the types of analysis that can be done with rela-
of providing a unified framework that can be used by re- tive ease on collected data. A physical representation of the
Data model:involvingMuch of the literature andpresented (Section 3.5
5. THEandNEED
searchers FOR across
practitioners INTEGRATED
many disciplines. SOLU- Inte- model
and
of large4.2)volumes
suitable
does notof emphasize
indexing
data is an or
storage mechanisms
draw explicit
important discussions
consideration for
grated solutions should ideally handle the entire workflow
TIONS
of the data analysis life cycle from collecting the tweets to
on the data model in use. The logical
efficient retrieval. We notice that current research pays data model greatly
lit-
There is a need to assimilate influences theto types of analysis that interactions,
can be done with rela-
presenting the results to theindividual
user. Theefforts with we
literature the have
goal tle attention queries on Twitter the social
of providing a unifiedsectionsframework that efforts
can bethat usedsupport
by re- tive
graph ease
in on collected A
particular. data.
graphA physical
view of representation
the Twittersphere of the is
reviewed in previous outlines
searchers and practitioners across model involving suitableand indexing and storage
great mechanisms
different parts of the workflow. In many disciplines.
this section, Inte-
we present consistently overlooked we recognize potential in
grated solutions of
thislarge
area.volumes of data is an important consideration for
our position withshouldthe aim ideally handle the
of outlining entire workflow
significant compo- The graph construction on Twitter is not lim-
of the ofdata analysis life cycle from collecting the tweets to efficient retrieval. We notice that current
ited to considering the users as nodes and links as follow- research pays lit-
nents an integrated solution addressing the limitations of
presenting the results to the user. The literature we have tle attention
ing relationships;to queries
embracing on Twitter
useful interactions,
characteristics thesuchsocial
as
existing systems.
reviewed in previous sections outlines efforts that support graph
retweet, in mention
particular. and Ahashtag
graph view of the Twittersphere
(co-occurrence) networks is
in
According to a review of literature conducted on the mi-
different parts of the workflow. In this section, we present consistently
the data modeloverlooked
will createandopportunity
we recognize to great
conduct potential
complex in
croblogging platform [25], the majority of published work
our positionconcentrates
with the aim this area.on Thethesegraph construction on ofTwitter
tweets.is We not envi-
lim-
on Twitter on of theoutlining
user domain significant
and the compo-
mes- analysis structural properties
nentsdomain.
of an integrated ited to considering the users as nodes and links as follow-
sage The usersolutiondomain addressing the limitations
explores properties of Twit- of sion a new data model that proactively captures structural
existing ing relationships; embracing useful characteristics such as
ter userssystems.
in the microblogging environment while the mes- relationships taking into consideration efficient retrieval of
According to a review of literature conducted on the mi- retweet,
relevant mention
data to and hashtag
perform the (co-occurrence) networks in
queries.
sage domain deals with properties exhibited by the tweets
croblogging platform [25], the majority the data model will create opportunity to conduct complex
themselves. In comparison to the extentofofpublished
work done workon Query
on Twitter concentrates on the user domain and the mes- analysis language: Languages
on these structural described
properties in Section
of tweets. We4,envi-
de-
the microblogging platform, only a few investigates the de- fines both simple operators to be appliedcaptures
on tweets and ad-
sage domain. The user domain explores properties of Twit- sion a new data model that proactively structural
velopment of data management frameworks and query lan- vanced operators that extract complex patterns, which can
ter users in the microblogging environment while the mes- relationships taking into consideration efficient retrieval of
guages that describe and facilitate processing of online so- be manipulated in different types of applications. Some lan-
sage domain deals with properties exhibited by the tweets relevant data to perform the queries.
cial networking data. In consequence, there is opportunity guages provide support for continuous queries on the stream
themselves.
for improvement In comparison
in this areatoforthe extent
future of work
research done on
addressing Query
or queries language:
on a storedLanguages
collection, described
while others in offer
Section 4, de-
flexibility
the microblogging
the challenges in dataplatform, only a few
management. Weinvestigates the de-
elicit the following finesboth.
for both Thesimple operators
advent to be view
of a graph applied on tweets
makes crucialand ad-
contri-
velopment of data management
high-level components and envisage frameworks
a platform andfor
query lan-
Twitter vanced
butions operators
in analyzing that theextract complex allowing
Twittersphere patterns,uswhich can
to query
guages
that that describe
encompasses suchand facilitate processing of online so-
capabilities: be manipulated
twitter data in in different
novel types offorms.
and varying applications.
It willSome lan-
be inter-
cial networking data. In consequence, there is opportunity guagesto
esting provide support
investigate how fortypical
continuous queries on
functionality the
[73] stream
provided
Focused crawler:
for improvement Responsible
in this area for for retrieval
future researchandaddressing
collection
or
by queries
graph queryon a stored
languagescollection,
can be while otherstooffer
adapted flexibility
Twitter net-
of Twitter
the challengesdata in bydata crawling
management.the publiclyWe elicitaccessible Twit-
the following for both.
works. In The adventquery
developing of a graph view makes
languages, one crucial
could contri-
investigate
ter APIs. A focused crawler should
high-level components and envisage a platform for Twitter allow the user to de-
butions in analyzing
the distinction between the real-time
Twittersphere allowing
and batch modeus to query
process-
fine
that aencompasses
campaign with such suitable
capabilities:filters, monitor output and
twitter data in novel and varying forms.
ing. Visualizing the data retrieved as a result of a query It will be inter-
iteratively crawl Twitter for large volumes of data until its
Focused crawler: Responsible for retrieval and collection esting
in a to investigate
suitable manner how
is typical
also an functionality
important [73] provided
concern. A set of
coverage of relevant tweets is satisfactory.
of Twitter data by crawling the publicly accessible Twit- by graph query
pre-defined output languages
formats can will bebe useful
adapted to Twitter
in order to providenet-
Pre-processor:
ter APIs. A focused As highlighted
crawler should in Section
allow the 3.2,user
thistostagede- works.
an In developing
informative query languages,
visualization over a mapone forcould
a queryinvestigate
that re-
usually compriseswith
fine a campaign of modules
suitablefor pre-processing
filters, monitor output and infor-and the distinction between real-time and batch mode process-
iteratively crawl Twitter for large volumes of data until its ing. Visualizing the data retrieved as a result of a query
coverage of relevant tweets is satisfactory. in a suitable manner is also an important concern. A set of
pre-defined output formats will be useful in order to provide
Pre-processor: As highlighted in Section 3.2, this stage an informative visualization over a map for a query that re-
usually comprises of modules for pre-processing and infor-

SIGKDD Explorations Volume 16, Issue 1 Page 16


turns an array of locations. Another interesting avenue to characters, which make use of informal language, undoubt-
explore is the introduction of a ranking mechanism on edly making a simple task of POS tagging more challeng-
the query result. Ranking criteria may involve relevance, ing. Besides the length limit, heavy and inconsistent usage
timeliness or network attributes like the reputation of users of abbreviations, capitalizations and uncommon grammar
in the case of a social graph. Ranking functions are a stan- constructions pose additional challenges to text processing.
turns an array of in
dard requirement locations. Another
the field of interesting
information avenue
retrieval to
[23,39] characters,
With the volume which of make tweets usebeing
of informal
orders language,
of magnitude undoubt-
more
explore is the introduction of a ranking mechanism
and studies like SociQL [30] report the use of visibility and on edly making a simple task of
than news articles, most of the conventional methodsPOS tagging more challeng-
cannot
the query result.
reputations metricsRanking
to rank criteria may involve
results generated fromrelevance,
a social ing.directly
be Besidesapplied
the length to the limit,
noisyheavy genreand inconsistent
of Twitter data.usageAny
timeliness or network
graph. A query languageattributes
with a like
graphtheview
reputation of users
of the Twitter- of abbreviations,
effort that uses Twitter capitalizations
data needsand uncommon
to make grammar
use of appropri-
in the case
sphere alongofwith
a social graph. Ranking
capabilities functionsand
for visualizations are ranking
a stan- constructions
ate twitter-specific pose additional
strategies to challenges
pre-process to text
text processing.
addressing
dardcertainly
will requirement in the
benefit thefield of information
upcoming retrieval
data analysis [23,39]
efforts of With
the the volume
challenges of tweets
associated being
with ordersproperties
intrinsic of magnitude more
of tweets.
and studies like SociQL [30] report the use of visibility and
Twitter. than newsinformation
Similarly, articles, most of the conventional
extraction from tweetsmethods cannot
is not straight-
reputations metrics
Here, we focused on to
therank
key results generated
ingredients from
required foraasocial
fully be directly
forward as it applied to the
is difficult tonoisy
derivegenre contextof Twitter
and topics data. fromAny a
graph.
developed A query language
solution with a improvements
and discussed graph view of the Twitter-
we can make effort that is
tweet that uses Twitter data
a scattered part of needs to make useThere
a conversation. of appropri-
is sep-
sphere alongliterature.
on existing with capabilities for visualizations
In the next and ranking
section, we identify chal- ate
aratetwitter-specific strategies to
literature on identifying pre-process
entities text addressing
(references to organi-
will certainly
lenges benefitissues
and research the upcoming
involved. data analysis efforts of the challenges
zations, places,associated
products, persons) with intrinsic properties
[47,63], languages of [20,36],
tweets.
Twitter. Similarly,
sentiment information
[57] present extraction
in the tweetfrom texttweets is not source
for a richer straight- of
Here, we focused on the key ingredients required for a fully forward
information.as it is difficult istoanother
Location derive context and topics
vital property from a
represent-
6. CHALLENGES
developed ANDimprovements
solution and discussed RESEARCH we ISSUES
can make tweet
ing that isfeatures
spatial a scattered either partof ofthe a conversation.
tweet or of the There
user.is sep-
The
on
To existing
completeliterature. In the next
our discussion, section,
in this sectionweweidentify chal-
summarize arate
locationliterature
of eachon identifying
tweet may beentities optionally (references
recorded to iforgani-
using
lenges
key and research
research issues
issues in datainvolved.
management and present tech- zations, places, products,
a GPS-enabled device. Apersons) user can [47,63], languages
also specify his[20,36],
or her
nical challenges that need to be addressed in the context of sentiment
location as[57] present
a part of the in theuser tweet
profile textand
for isa richer
often source
reported of
building a data analytics platform for Twitter. information. Location is another
in varying granularities. The drawback is that only a smallvital property represent-
6. CHALLENGES AND RESEARCH ISSUES ing spatial
portion features
of about 1% either of the tweet
of the tweets or of the [24].
are geo-located user. Since
The
6.1complete
To Data Collection Challenges
our discussion, in this section we summarize location of eachalways
analysis almost tweet may requires be optionally
the location recorded
property, if using
when
key research issues in data management and present tech- a GPS-enabled
absent, device. A
studies conduct userown
their canmechanisms
also specifytohisinfer or her
lo-
Once a suitable Twitter API has been identified, we can
nical challenges that need to be addressed in the context of locationofas
cation the a part
user, of the user
a tweet, or profile
both. and There is often
are two reported
major
define a campaign with a set of parameters. The focused
building a data analytics platform for Twitter. in varying granularities.
approaches for location The drawback
prediction: is thatanalysis
content only a small with
crawler can be programmed to retrieve all tweets matching
the query of the campaign. If a social graph is necessary, portion of about
probabilistic 1% of the
language modelstweets [24,are27,geo-located
37] or inference[24]. Since
from
6.1 Data Collection Challenges
separate modules would be responsible to create this net- analysis
social and almost
otheralways
relations requires
[22, 29,the 67].location property, when
Once iteratively.
work a suitable Exhaustively
Twitter API crawling
has beenallidentified, we can
the relationships absent, studies conduct their own mechanisms to infer lo-
define a campaign with a set of parameters.
between Twitter users is prohibitive given the restrictions The focused 6.3
cation Data
of the Management
user, a tweet, or Challenges both. There are two major
crawler
set by the canTwitter
be programmed
API. Hence to itretrieve all tweets
is required for thematching
focused approaches for location prediction: content analysis with
In Section 3.5 and Section 4.2 we outlined several alterna-
the query
crawler to of the campaign.
prioritize If a social
the relationships to graph is necessary,
crawl based on the probabilistic language models [24, 27, 37] or inference from
tive approaches in literature for a data model to charac-
separateand
impact modules wouldofbespecific
importance responsible
Twitter to accounts.
create thisInnet- the social and other relations [22, 29, 67].
terise the Twittersphere. The relational and RDF models
work iteratively.
case that Exhaustively
the platform handlescrawling
multipleallcampaigns
the relationships
in par- are frequently chosen while graph-based models are acknowl-
between
allel, there Twitter users to
is a need is optimize
prohibitive thegiven
accessthetorestrictions
the API. 6.3 Data Management Challenges
edged, however not realized concretely at the implementa-
set by the Twitter API. Hence itofisarequired
crawler for the focused In
Typically, the implementation should aim to tion phase.3.5
Section WaysandinSectionwhich 4.2 we we canoutlined
apply graphseveral dataalterna-
man-
crawler
minimizetothe prioritize
number theofrelationships
API requests, to considering
crawl based the on the
re- tive approaches in literature for a diverse
data model to charac-
agement in Twitter are extremely and interesting;
impact
strictions, while fetching data for many campaigns in In
and importance of specific Twitter accounts. the terise thetypes
Twittersphere.
paral- different of networksThe canrelational
be constructedand RDF apart models
from
case that the
lel. Hence platform
building handles multiple
an effective crawling campaigns
strategy is in par-
a chal- are frequently chosen while graph-based models are acknowl-
the traditional social graph as outlined in Section 5. With
allel, there is a need to optimize the access to the API. edged, however not realized
lenging task, in order to optimize the use of API requests the advent of a graph view to concretely at the implementa-
model the Twittersphere, gives
Typically, the implementation of a crawler should aim to tion phase. Ways in which
available. rise to a range of queries thatwe cancan be apply
performedgraph ondata man-
the struc-
minimize
Appropriate thecoverage
number of ofthe
API requests,is considering
campaign the re-
another significant agement intweets
Twitter are extremely diverse and range
interesting;
ture of the essentially capturing a wider of use
strictions, while fetching data for many campaigns in paral- different typesused of networks
concern and denotes whether all the relevant information has case scenarios in typicalcan data be analytics
constructed apart from
tasks.
lel. Hence building an effective crawling strategy is to adefine
chal- the traditional
been collected. When specifying the parameters As discussed in social
Sectiongraph as outlined
4.3, there in Section
are already languages5. Withsim-
lenging task, ina order
the campaign, to optimize
user needs a very the good useknowledge
of API requestson the the advent of a graph view to model the Twittersphere, gives
ilar to SQL adapted to social networks. Many of the tech-
available. rise to ainrange of queries
relevant keywords. Depending on the specified keywords, niques literature are that
for the can case
be performed
of genericonsocialthe struc-
net-
Appropriate
a collection coverage
may missofrelevant
the campaign tweetsis inanother
addition significant
to the ture of the tweets essentially capturing a wider range of use
works under a number of specific assumptions. For example
concern and denotes whether all theby relevant
APIs. information has casesocial
scenarios used satisfy
in typical data analytics
tweets removed due to restrictions Plachouras and the networks properties such astasks.
the power law
been collected.
Stavrakas work [60]Whenis anspecifying
initial stepthe parameters
in this directiontoasdefineit in- As discussed in Section 4.3,small
therediameters
are already languages sim-
distribution, sparsity and [52]. We envision
the campaign,
vestigates a user of
this notion needs a very
coverage andgood knowledge
proposes on the
mechanisms ilar to SQL adapted to social networks. Many of the tech-
queries that take another step further and executes on Twit-
relevant keywords. Depending on the specified keywords, niques in literature are forlanguages
the case FQL of generic social
to automatically adapt the campaign to evolving hashtags. ter graphs. Simple query [1] and YQLnet- [7]
a collection may miss relevant tweets in addition to the works
provideunderfeaturesa number
to explore of specific
properties assumptions.
of Facebook For example
and Ya-
6.2 Pre-processing
tweets Challenges
removed due to restrictions by APIs. Plachouras and the social
hoo APIs but networks satisfy
are limited toproperties
querying only suchpart(usually
as the power law
a sin-
Stavrakas work [60] is an initial step in this direction as it in- distribution, sparsity and small diameters [52]. We
Many problems associated with summarization, topic de- gle users connections) of the large social graph. As envision
pointed
vestigates this notion of coverage and proposes mechanisms queries thatreport
take another
tection and part-of-speech (POS) tagging, in the case of out in the on the step furtherand
Databases and Web
executes2.0 onPanelTwit-at
to automatically adapt the campaign to evolving hashtags. ter graphs. Simple query languages FQL [1] trust,
and YQL [7]
well-formed documents, e.g. news articles, have been ex- VLDB 2007 [9], understanding and analyzing author-
tensively studied in the literature. Traditional named entity provide features to
ity, authenticity, and explore
other properties
quality measures of Facebook and net-
in social Ya-
6.2 Pre-processing Challenges
recognizers (NERs) heavily depend on local linguistic fea- hoo APIs but are limited to querying
works pose major research challenges. While there are in- only part(usually a sin-
Many [62]
tures problems associated
of well-formed with summarization,
documents like capitalizationtopic andde- gle users connections)
vestigations on these quality of themeasures
large social graph. As
in Twitter, itspointed
about
tection
POS tagging and part-of-speech
of previous words. (POS) Nonetagging,
of the in the case of
characteristics out
timeinthatthewe report
enable ondeclarative
the Databases and Web
querying 2.0 Panel
of networks usingat
well-formed
hold for tweets documents,
with shorte.g. news articles,
utterances of tweetshave limitedbeen ex-
to 140 VLDB
such 2007 [9], understanding and analyzing trust, author-
measurements.
tensively studied in the literature. Traditional named entity ity, authenticity, and other quality measures in social net-
recognizers (NERs) heavily depend on local linguistic fea- works pose major research challenges. While there are in-
tures [62] of well-formed documents like capitalization and vestigations on these quality measures in Twitter, its about
POS tagging of previous words. None of the characteristics time that we enable declarative querying of networks using
hold for tweets with short utterances of tweets limited to 140 such measurements.

SIGKDD Explorations Volume 16, Issue 1 Page 17


One of the predominant challenges is the management of [3] Sparksee: Scalable high-performance graph database.
large graphs that inevitably results from modeling users, http://www.sparsity-technologies.com/.
tweets and their properties as graphs. With the large volume
of data involved in any practical task, a data model should [4] Titan: distributed graph database. http:
be information rich, yet a concise representation that en- //thinkaurelius.github.io/titan.
One
ables of the predominant
expression challenges
of useful queries. is theonmanagement
Queries graphs should of [3] Sparksee: Scalable high-performance graph database.
large graphs that inevitably results from modeling users, [5] TrendsMap, Realtime local twitter trends. http://
http://www.sparsity-technologies.com/.
be optimized for large networks and should ideally run in-
tweets and their trendsmap.com/.
dependent of theproperties
size of the as graphs.
graph. With Therethe large volume
arealready ap-
of data involved in any practical [4] Titan: distributed graph database. http:
proaches that investigate efficienttask, a data model
algorithms on veryshould large [6] Twitalyzer: Serious analytics for social business. http:
be information rich,Efficient
yet a concise representation //thinkaurelius.github.io/titan.
graphs [4143, 66]. encoding and indexingthat mecha-en-
//twitalyzer.com.
ables expression
nisms should be in of place
usefultaking
queries. intoQueries
accountonvariations
graphs shouldof in- [5] TrendsMap, Realtime local twitter trends. http://
be optimized
dexing systemsfor large proposed
already networks for andtweets
should [23]ideally run in-
and indexing [7] Yahoo! Query Language guide on YDN. https://
trendsmap.com/.
dependent
of graphs [75] of the size of the
in general. We graph.
need to There
consider arealready
maintaining ap- developer.yahoo.com/yql/.
proaches
indexes for that investigate
tweets, keywords, efficient
users,algorithms
hashtags for onefficient
very largeac- [6] Twitalyzer: Serious analytics for social business. http:
graphs [4143,
cess of data in 66].
advance Efficient
queries.encoding and situations
In certain indexing mecha-it may [8] F. Abel, C. Hauff, and G. Houben. Twitcident: fight-
//twitalyzer.com.
nisms
be should beto
impractical in store
place thetaking intoraw
entire account
tweetvariations
and the of in-
com- ing fire with information from social web streams. In
dexing
plete user systems
graphalready
and it proposed for tweetsto[23]
may be desirable andcompress
either indexing WWW, pages
[7] Yahoo! Query 305308, 2012.
Language guide on YDN. https://
of graphs
or [75] in general.
drop portions of the data. We It need to considertomaintaining
is important investigate developer.yahoo.com/yql/.
indexes for tweets,ofkeywords,
which properties tweets and users,
thehashtags for efficient
graph should be com-ac- [9] S. Amer-Yahia, V. Markl, A. Halevy, A. Doan,
cess of
pressed. data in advance queries. In certain situations it may G. Abel,
[8] F. Alonso,C. D. Kossmann,
Hauff, and G. and G. Weikum.
Houben. Databases
Twitcident: fight-
be impractical to challenges,
store the entire and Webwith
ing fire 2.0 panel at VLDB
information 2007.
from In SIGMOD
social Record,
web streams. In
Besides the above tweetsraw tweet
impose and the
general com-
research
plete volume
WWW, 37, pages
pages 4952,2012.
305308, Mar. 2008.
issuesuser graph
related to and it mayChallenges
big data. be desirable to either
should compress
be addressed
or dropsame
in the portions
spiritofasthe anydata.
other It big
is important
data analyticsto investigate
task. In
which [10]
[9] S. AmerYahia;,
Amer-Yahia,L. V. V. Lakshmanan;,
Markl, A. Halevy, and Cong A. Yu. So-
Doan,
the faceproperties
of challenges of tweets
posed and the graph
by large volumes should
of data be being
com-
cialScope
G. Alonso,: D. Enabling information
Kossmann, discoveryDatabases
and G. Weikum. on social
pressed.
collected, the NoSQL paradigm should be considered as an content
and Websites. In CIDR,
2.0 panel 2009.2007. In SIGMOD Record,
at VLDB
Besides
obvious the above
choice of challenges,
dealing with tweets
them. impose generalsolutions
Developed research
volume 37, pages 4952, Mar. 2008.
issues related
should to big for
be extensible data. Challenges
upcoming should beand
requirements addressed
should [11] T. Baldwin, P. Cook, and B. Han. A support platform
in the same
indeed spirit When
scale well. as anycollecting
other bigdata, dataaanalytics
user needtask. to con-In [10] for event detection
S. AmerYahia;, using
L. V. social intelligence.
Lakshmanan;, and CongIn Demon-
Yu. So-
the face
sider of challenges
scalable crawlingposed as thereby large volumes
are large of data
volumes being
of tweets strations
cialScope at: the 13th Conference
Enabling information of the European
discovery on Chap-
social
collected, the NoSQL
received, processed and paradigm
indexed should be considered
per second. With respectas an ter of the
content Association
sites. In CIDR, for2009.Computational Linguistics,
obvious choice of dealing with them.
to implementation, it is necessary to investigate paradigms Developed solutions pages 6972, 2012.
should
that be well,
scale extensible for upcoming
like MapReduce whichrequirements
is optimized andforshould
offline [11] T. Baldwin, P. Cook, and B. Han. A support platform
indeed
analytics scale well. data
on large When collecting on
partitioned data, a user need
hundreds to con-
of machines. [12] for
L. Barbosa and J. Feng.
event detection usingRobust sentiment detection
social intelligence. In Demon-on
sider scalable
Depending oncrawling
the complexityas thereofarethe large volumes
queries of tweets
supported, it Twitter
strationsfrom biased
at the 13th and noisy data.
Conference of thepages 3644,Chap-
European Aug.
received,
might be processed
difficult toand expressindexed
graph per second. With
algorithms respect
intuitively in 2010.
ter of the Association for Computational Linguistics,
to implementation,
MapReduce it is necessary
graph models, to investigate
consequently databases paradigms
such as pages 6972, 2012.
that scale [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
Titan [4], well,
DEXlike [3],MapReduce
and Neo4j which is optimized
[2] should be compared for offline
for
analytics on large data partitioned on hundreds of machines.
graph implementations. [12] and E. H. Chi.
L. Barbosa and Eddi: interactive
J. Feng. topic-based
Robust sentiment browsing
detection on
Depending on the complexity of the queries supported, it of social from
Twitter status streams.
biased In 23nd
and noisy annual
data. pagesACM sympo-
3644, Aug.
might be difficult to express graph algorithms intuitively in sium
2010. on User interface software and technology - UIST,
7. CONCLUSION
MapReduce graph models, consequently databases such as pages 303312, Oct. 2010.
In this paper we addressed issues[2]around thebebig data nature [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
Titan [4], DEX [3], and Neo4j should compared for
of Twitter analytics and the need for new data management [14] and
A. Bifet
E. H.and
Chi.E. Eddi:
Frank.interactive
Sentimenttopic-based
knowledge discovery
browsing
graph implementations.
and query language frameworks. By conducting a careful in
of twitter streaming
social status data.InDiscovery
streams. Science.
23nd annual ACMSpringer
sympo-
and extensive review of the existing literature we observed Berlin
sium onHeidelberg, pagessoftware
User interface 115, Oct.
and 2010.
technology - UIST,
7.
ways CONCLUSION
in which researchers have tried to develop general plat- pages 303312, Oct. 2010.
In this to
forms paper we addressed
provide a repeatable issues around thefor
foundation bigTwitter
data naturedata [15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog-
of Twitter We
analytics. analytics
reviewed andthe thetweet
need for new data
analytics space management
by explor- [14] gins. Twitter
A. Bifet and E. Zombie:
Frank. Architecture for capturing,
Sentiment knowledge so-
discovery
and mechanisms
ing query language frameworks.
primarily for data By conducting
collection, data amanage-
careful cially transforming
in twitter streaming anddata.analyzing
Discoverythe Twittersphere.
Science. Springer
and
mentextensive reviewfor
and languages of querying
the existing andliterature
analyzingwe observed
tweets. We In International
Berlin Heidelberg,conference
pages 115, on Oct.
Supporting
2010. group work,
ways in which researchers
have identified the essential have tried to develop
ingredients requiredgeneral plat-
for a unified pages 229238, 2012.
forms to provide a repeatable foundation for Twitter data [15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog-
framework that address the limitations of existing systems. [16] gins.
M. Boanjak
Twitter and E. Oliveira.
Zombie: TwitterEcho
Architecture - A dis-
for capturing, so-
analytics.
The paperWe reviewed
outlines the tweet
research issuesanalytics space bysome
and identifies explor-of
ing mechanisms primarily with for data tributed focused crawler
cially transforming and to support the
analyzing openTwittersphere.
research with
the challenges associated the collection,
development data of manage-
such in- twitter data. In International
In International conference onconference
Supporting companion on
group work,
ment
tegrated andplatforms.
languages for querying and analyzing tweets. We
have identified the essential ingredients required for a unified World Wide Web,
pages 229238, pages 12331239, 2012.
2012.
framework that address the limitations of existing systems.
8. REFERENCES
The paper outlines research issues and identifies some of
[17] K. Bontcheva
[16] M. Boanjak and andE.L.Oliveira.
Derczynski. TwitIE: an
TwitterEcho - Aopen-
dis-
source
tributedinformation extraction
focused crawler pipeline
to support open for microblog
research with
the challenges associated with the development of such in- text. Indata.
twitter International Conference
In International on Recent
conference Advances
companion on
[1] FaceBook Query Language(FQL) overview.
tegrated platforms. in Natural
World WideLanguage Processing,
Web, pages 2013.
12331239, 2012.
https://developers.facebook.com/docs/
technical-guides/fql.
8. REFERENCES [18] C. Budak,
[17] K. T. Georgiou,
Bontcheva and D. E. Abbadi.
and L. Derczynski. TwitIE: GeoScope:
an open-
[2] Neo4j: The worlds leading graph database. http:// Online detection of extraction
source information geo-correlated information
pipeline trends
for microblog
www.neo4j.org/.
[1] FaceBook Query Language(FQL) overview. in social
text. networks. PVLDB,
In International 7(4):229240,
Conference 2013.
on Recent Advances
https://developers.facebook.com/docs/ in Natural Language Processing, 2013.
technical-guides/fql.
[18] C. Budak, T. Georgiou, and D. E. Abbadi. GeoScope:
[2] Neo4j: The worlds leading graph database. http:// Online detection of geo-correlated information trends
www.neo4j.org/. in social networks. PVLDB, 7(4):229240, 2013.

SIGKDD Explorations Volume 16, Issue 1 Page 18


[19] C. Byun, H. Lee, Y. Kim, and K. K. Kim. Twitter [34] S. Frenot and S. Grumbach. An in-browser microblog
data collecting tool with rule-based filtering and analy- ranking engine. In International conference on Ad-
sis module. International Journal of Web Information vances in Conceptual Modeling, volume 7518, pages 78
Systems, 9(3):184203, 2013. 88, 2012.

[20] S. Carter,
[19] C. W. Lee,
Byun, H. Weerkamp,
Y. Kim, andand
M. K.
Tsagkias.
K. Kim.Microblog
Twitter [35] G. Fr
[34] S. Golovchinsky
enot and S.and M. Efron.An
Grumbach. Making sense of
in-browser Twitter
microblog
language identification:
data collecting tool with Overcoming the limitations
rule-based filtering of
and analy- search.
rankingInengine.
CHI, 2010.
In International conference on Ad-
short, unedited
sis module. and idiomatic
International text. of
Journal Language Resources
Web Information vances in Conceptual Modeling, volume 7518, pages 78
and Evaluation, 47(1):195215, [36] M. Graham, S. A. . Hale, and D. Gaffney. Where in the
Systems, 9(3):184203, 2013. June 2012. 88, 2012.
world are you ? Geolocation and language identification
[21] S.
[20] M.Carter,
Cha, H.W. Haddadi,
Weerkamp,F. Benevenuto, and K.Microblog
and M. Tsagkias. P. Gum- in Twitter.
[35] G. In ICWSM,
Golovchinsky and M. pages
Efron.518521, 2012.of Twitter
Making sense
madi. Measuring
language user influence
identification: in twitter:
Overcoming The million
the limitations of search. In CHI, 2010.
[37] B. Hecht, L. Hong, B. Suh, and E. Chi. Tweets from
follower
short, fallacy. and
unedited In ICWSM,
idiomaticpages
text. 1017,
Language2010.
Resources Justin
[36] M. Biebers
Graham, S. heart: the and
A. . Hale, dynamics of the Where
D. Gaffney. locationinfield
the
and Evaluation, 47(1):195215, June 2012. in userareprofiles. In Conference
[22] S. Chandra, L. Khan, and F. B. Muhaya. Estimating world you ? Geolocation and on Human
language Factors in
identification
[21] twitter
M. Cha,userH. location
Haddadi,using social interactionsa
F. Benevenuto, and K. P.content
Gum- Computing
in Twitter. Systems,
In ICWSM, pages 237246,
pages 2011.
518521, 2012.
based approach. In
madi. Measuring IEEE
user Conference
influence on Privacy,
in twitter: Se-
The million [38]
[37] B. J. Jansen,
Hecht, M. Zhang,
L. Hong, K. Sobel,
B. Suh, and E. and
Chi.A.Tweets
Chowdury.
from
curity,
followerRisk and In
fallacy. Trust, pagespages
ICWSM, 838843, Oct.
1017, 2011.
2010. Twitter power: heart:
Tweets
Justin Biebers theas electronic
dynamics word
of the of mouth.
location field
Journal of the American
in user profiles. SocietyonforHuman
In Conference Information
FactorsSci-in
[23] C. Chandra,
[22] S. Chen, F. L.Li, Khan,
C. Ooi,andandF. S.B.Wu. TI : An
Muhaya. efficient
Estimating
ence and Technology,
Computing 60(11):21692188,
Systems, pages 237246, 2011. Nov. 2009.
indexing mechanism
twitter user for real-time
location using search. In SIGMOD,
social interactionsa content
pages
based 649660,
approach.2011.
In IEEE Conference on Privacy, Se- [39]
[38] J.
B. Jiang, L. Hidayah,
J. Jansen, T.K.
M. Zhang, Elsayed,
Sobel, and A.H. Chowdury.
Ramadan.
curity, Risk and Trust, pages 838843, Oct. 2011. BEST
Twitterofpower:
KAUST at TREC-2011
Tweets : Building
as electronic word of effective
mouth.
[24] Z. Cheng, J. Caverlee, K. Lee, and C. Science. A search
JournalinofTwitter. TREC, Society
the American 2011. for Information Sci-
[23] content-driven
C. Chen, F. Li,framework
C. Ooi, and forS.geo-locating
Wu. TI : An microblog
efficient
ence and Technology, 60(11):21692188, Nov. 2009.
users. ACM
indexing Transactions
mechanism on Intelligent
for real-time Systems
search. In SIGMOD,and [40] P. J
urgens, A. Jungherr, and H. Schoen. Small worlds
Technology, 2012.
pages 649660, 2011. [39] with a difference:
J. Jiang, new gatekeepers
L. Hidayah, T. Elsayed, and
and the
H. filtering
Ramadan. of
political
BEST ofinformation
KAUST at on Twitter. In: International
TREC-2011 Web
Building effective
[25] M. Cheng,
[24] Z. Cheong J.
andCaverlee,
S. Ray. A
K.literature
Lee, andreview of recent
C. Science. A Science
search inConference-WebSci, pages 15, June 2011.
Twitter. TREC, 2011.
microblogging
content-driven developments.
framework forTechnical report,
geo-locating Clayton
microblog
School of Information
users. ACM Technology,
Transactions Monash
on Intelligent University,
Systems and [41]
[40] U. Kang,
P. Jurgens, D.A.
H. Jungherr,
Chau, andand C. Faloutsos.
H. Schoen.Managing and
Small worlds
2011.
Technology, 2012. mining large graphsnew
with a difference: : Systems and implementations.
gatekeepers and the filtering In
of
SIGMOD, volume 1, on
political information pages 589592,
Twitter. 2012.
In International Web
[26] Chew,
[25] M. Cynthia,
Cheong and and G. Eysenbach.
S. Ray. A literaturePandemics in the
review of recent Science Conference-WebSci, pages 15, June 2011.
age of twitter:developments.
microblogging content analysis of tweets
Technical during
report, the
Clayton [42] U. Kang and C. Faloutsos. Big graph mining :
2009
SchoolH1N1 outbreak. PloS
of Information one, 5(11),
Technology, 2010.University,
Monash [41] Algorithms
U. Kang, D. H. andChau,
discoveries. SIGKDD Managing
and C. Faloutsos. Explorations,
and
2011. 14(2):2936, 2013. : Systems and implementations. In
mining large graphs
[27] B. O. Connor, N. A. Smith, and E. P. Xing. A latent SIGMOD, volume 1, pages 589592, 2012.
[26] variable model forand
Chew, Cynthia, geographic lexical variation.
G. Eysenbach. PandemicsIninCon-
the [43] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos.
ference on Empirical
age of twitter: Methods
content in Natural
analysis Language
of tweets duringPro-
the [42] Gbase:
U. Kang An and
efficient
C. analysis
Faloutsos. platform for large
Big graph graphs.:
mining
cessing,
2009 H1N1 pages 12771287,
outbreak. PloS2010.
one, 5(11), 2010. VLDB Journal,
Algorithms and21(5):637650, June 2012.Explorations,
discoveries. SIGKDD
14(2):2936, 2013.
[28] Conover,
[27] B. Michael,
O. Connor, J. Ratkiewicz,
N. A. Smith, and E. P. M.Xing.Francisco,
A latent [44] S. Kumar, G. Barbier, M. Abbasi, and H. Liu. Tweet-
B. Goncalves,
variable model F.forMenczer, and
geographic A. Flammini.
lexical variation. Political
In Con- [43] Tracker:
U. Kang, An H. analysis
Tong, J. tool
Sun,for humanitarian
C.-Y. Lin, and C.and disaster
Faloutsos.
polarization on Twitter.
ference on Empirical In ICWSM,
Methods 2011.
in Natural Language Pro- relief.
Gbase:InAn ICWSM,
efficientpages 661662,
analysis 2011.
platform for large graphs.
cessing, pages 12771287, 2010. VLDB Journal, 21(5):637650, June 2012.
[45] H. Kwak, C. Lee, H. Park, and S. Moon. What is twit-
[29] J. David. Thats what friends are for inferring location
[28] in online social
Conover, mediaJ.platforms
Michael, based M.
Ratkiewicz, on social rela-
Francisco, [44] ter, a socialG.
S. Kumar, network or aM.
Barbier, news media?
Abbasi, andInH.WWW, pages
Liu. Tweet-
591600,
Tracker: An 2010.
analysis tool for humanitarian and disaster
tionships. In ICWSM,
B. Goncalves, 2013.and A. Flammini. Political
F. Menczer,
polarization on Twitter. In ICWSM, 2011. relief. In ICWSM, pages 661662, 2011.
[46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and
[45] A
H.novel
Kwak, approach
C. Lee, H.for Park,
event anddetection by mining
S. Moon. What is spatio-
twit-
[29] V. Guana.Thats
J. David. SociQL:
whatAfriends
query are
language for the
for inferring social
location temporal information
ter, a social network or on microblogs.
a news International
media? In WWW, pages
Web. In social
in online E. Kranakis, editor, Advances
media platforms based on in Network
social rela- Conference on Advances in Social Networks Analysis
591600, 2010.
Analysis
tionships.and its Applications,
In ICWSM, 2013. chapter 17, pages 381 and Mining, pages 254259, July 2011.
406. 2013. [46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and [47] A
C. novel
Li, J. approach
Weng, Q.for He,event
Y. Yao, and A.by
detection Datta.
miningTwiNER:
spatio-
[31] V.
Y. Doytsher and B. Galon.
Guana. SociQL: Querying
A query geo-social
language for thedata by
social named entity
temporal recognition
information on in targeted twitter
microblogs. stream. In
In International
bridging
Web. In spatial networkseditor,
E. Kranakis, and social networks.
Advances In 2nd
in Network SIGIR, pages
Conference on721730,
Advances 2012.
in Social Networks Analysis
ACM SIGSPATIAL
Analysis International
and its Applications, Workshop
chapter on Loca-
17, pages 381 and Mining, pages 254259, July 2011.
tion
406. Based
2013. Social Networks, pages 3946, 2010. [48] A. Marcus, M. Bernstein, and O. Badar. Tweets as
data:
[47] C. Li, demonstration
J. Weng, Q. He,ofY. TweeQL
Yao, and andA.Twitinfo. In SIG-
Datta. TwiNER:
[32] A. Dries,
[31] Y. S. Nijssen,
Doytsher and L. De
and B. Galon. Raedt.geo-social
Querying A query language
data by MOD, pages
named entity 12591261,
recognition2011.
in targeted twitter stream. In
for analyzing
bridging networks.
spatial In CIKM,
networks pages
and social 485494,In2009.
networks. 2nd SIGIR, pages 721730, 2012.
ACM SIGSPATIAL International Workshop on Loca- [49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] tion
M. Efron.
BasedHashtag retrieval inpages
Social Networks, a microblogging
3946, 2010.environ- [48] visualizing
A. Marcus,the M.data in tweets.
Bernstein, andSIGMOD
O. Badar.Record, 40(4),
Tweets as
ment. pages 787788, 2010. 2012.
data: demonstration of TweeQL and Twitinfo. In SIG-
[32] A. Dries, S. Nijssen, and L. De Raedt. A query language MOD, pages 12591261, 2011.
for analyzing networks. In CIKM, pages 485494, 2009.
[49] A. Marcus, M. Bernstein, and O. Badar. Processing and
[33] M. Efron. Hashtag retrieval in a microblogging environ- visualizing the data in tweets. SIGMOD Record, 40(4),
ment. pages 787788, 2010. 2012.

SIGKDD Explorations Volume 16, Issue 1 Page 19


[50] M. S. Martn and C. Gutierrez. Representing, querying [63] A. Ritter, S. Clark, and O. Etzioni. Named entity recog-
and transforming social networks with RDF/SPARQL. nition in tweets : an experimental study. In Conference
In European Semantic Web Conference, pages 293307, on Empirical Methods in Natural Language Processing,
2009. pages 15241534, 2011.
[51] P. T.
[50] M. S. W. Mauro
Martn andSan Martn, Claudio
C. Gutierrez. Gutierrez.
Representing, SNQL
querying [63] A.
[64] R. Ritter,
Ronen andS. Clark, and O. SoQL:
O. Shmueli. Etzioni.ANamed entity
language recog-
for query-
:and
A social networksocial
transforming querynetworks
and transformation language.
with RDF/SPARQL. nition
ing in tweets
and : an
creating experimental
data in social study. In Conference
networks. In ICDE,
In European
5th Alberto Mendelzon
Semantic Web International Workshop
Conference, pages on
293307, on Empirical
pages Methods
15951602, Mar.in Natural Language Processing,
2009.
Foundations
2009. of Data Management, 2011. [65] pages 15241534,
T. Sakaki. 2011.shakes twitter users : Real-time
Earthquake
event detection by social sensors. In WWW, pages 851
[52] M.T.
[51] P. Mcglohon
W. Mauro andSanC.Mart
Faloutsos. Statistical
n, Claudio properties
Gutierrez. SNQL [64] R. Ronen
860, 2010.and O. Shmueli. SoQL: A language for query-
of
: Asocial
social networks. In C.and
network query C. transformation
Aggarwal, editor, Social
language. ing and creating data in social networks. In ICDE,
Network Data Analytics,
In 5th Alberto Mendelzonchapter 2, pagesWorkshop
International 1742. 2011.
on pages
[66] S. 15951602,
Salihoglu and J.Mar. 2009.GPS : A graph processing
Widom.
Foundations of Data Management, 2011. [65] T. Sakaki.
system. In Earthquake
Internationalshakes twitter on
Conference users : Real-time
Scientific and
[53] P. Mendes, A. Passant, and P. Kapanipathi. Twarql:
event detection
Statistical by social
Database sensors. Inpages
Management, WWW, pages
131, 851
2013.
[52] tapping into the
M. Mcglohon andwisdom of the crowd.
C. Faloutsos. In Proceedings
Statistical of
properties
860, 2010.
the 6th International
of social networks. InConference on Semantic
C. C. Aggarwal, Systems,
editor, Social [67] A. Schulz, A. Hadjakos, and H. Paulheim. A multi-
pages
Network35, 2010.
Data Analytics, chapter 2, pages 1742. 2011. [66] S. Salihoglu
indicator and J. Widom.
approach GPS : A graph
for geolocalization processing
of tweets. In
[54] system. Inpages
ICWSM, International Conference on Scientific and
573582, 2013.
[53] F. Morstatter,
P. Mendes, S. Kumar,
A. Passant, andH.P.Liu, and R. Maciejew-
Kapanipathi. Twarql:
ski. Understanding Twitter datacrowd.
with TweetXplorer. Statistical Database Management, pages 131, 2013.
tapping into the wisdom of the In Proceedings In
of [68] A. Signorini, A. M. Segre, and P. M. Polgreen. The
SIGKDD, pages 14821485,
the 6th International 2013.on Semantic Systems,
Conference [67] A.
use Schulz, A. Hadjakos,
of Twitter and H.
to track levels Paulheim.
of disease A multi-
activity and
pages 35, 2010. indicator
public approach
concern in the for
U.S.geolocalization of tweets.
during the influenza A H1N1In
[55] P. Noordhuis, M. Heijkoop, and A. Lazovik. Mining
[54] Twitter in the cloud:
F. Morstatter, A caseH.study.
S. Kumar, Liu, In
andIEEE 3rd Inter-
R. Maciejew- ICWSM, pages
pandemic. PloS 573582,
one, 6(5),2013.
Jan. 2011.
national ConferenceTwitter
ski. Understanding on Cloud
dataComputing, pages 107
with TweetXplorer. In [68] A.
[69] Signorini, and
Y. Stavrakas A. M.V. Segre, and P.A M.
Plachouras. Polgreen.
platform The
for sup-
114, July 2010.
SIGKDD, pages 14821485, 2013. use of Twitter
porting to track
data analytics onlevels
twitterof challenges
disease activity and
and objec-
[56]
[55] I.
P. Ounis, C. Macdonald,
Noordhuis, M. Heijkoop,J.and Lin,
A. and I. Soboroff.
Lazovik. Mining publicIntl.
tives. concern in the U.S.
Workshop during theExtraction
on Knowledge influenza A&H1N1
Con-
Overview
Twitter inofthe
the TREC-2011
cloud: Microblog
A case study. Track.
In IEEE 3rdInInter-
20th pandemic. from
solidation PloS Social
one, 6(5),
Media,Jan.(Ict
2011.
270239), 2013.
Text REtrieval
national Conference
Conference (TREC),
on Cloud 2011. pages 107
Computing, [69] Y.
[70] Stavrakas
Tumasjan, and V. Plachouras.
Andranik, A platform
T. O. Sprenger, for sup-
P. G. Sandner,
114, July 2010. porting data analytics on twitter challenges
[57] A. Pak and P. Paroubek. Twitter as a corpus for sen- and I. M. Welpe. Predicting Elections withand objec-
Twitter:
[56] timent analysis
I. Ounis, and opinionJ.mining.
C. Macdonald, Lin, andIn International
I. Soboroff. tives. Intl.
What Workshop on
140 Characters Knowledge
Reveal Extraction
about Political & Con-
Sentiment.
Conference
Overview of onthe Language
TREC-2011 Resources
Microblogand Evaluation,
Track. In 20th solidation
In ICWSM, from Social
pages Media,
178185, (Ict 270239), 2013.
2010.
pages 13201326,
Text REtrieval 2010.
Conference (TREC), 2011.
[70]
[71] Tumasjan, Andranik,
J. Weng, E.-p. Lim, andT.J.O. Sprenger,
Jiang. P. G. Sandner,
TwitterRank : Find-
[58] Paul,
[57] A. PakM.and
J, and M. Dredze.Twitter
P. Paroubek. In ICWSM, pages 265272.
as a corpus for sen- and I. M. Welpe. Predicting
ing topic-sensitive influential Elections
twitterers.with
In Twitter:
WSDM,
timent analysis and opinion mining. In International What
pages 140 Characters
261270, 2010. Reveal about Political Sentiment.
[59] Conference
V. Plachouras on and Y. Stavrakas.
Language Querying
Resources term asso-
and Evaluation, In ICWSM, pages 178185, 2010.
ciations and their 2010.
temporal evolution in social data. In [72] J. S. White, J. N. Matthews, and J. L. Stacy. Coalmine:
pages 13201326,
International VLDB Workshop on Online Social Sys- [71] an
J. Weng, E.-p. in
experience Lim, and J.aJiang.
building systemTwitterRank : Find-
for social media an-
[58] tems, 2012.
Paul, M. J, and M. Dredze. In ICWSM, pages 265272. ing topic-sensitive
alytics. influential
In I. V. Ternovskiy andtwitterers. In WSDM,
P. Chin, editors, Pro-
pages 261270,
ceedings of SPIE, 2010.
volume 8408, 2012.
[60] V. Plachouras,
[59] Plachouras andY. Stavrakas, and Querying
Y. Stavrakas. A. Andreou.termAssess-
asso-
ing the coverage
ciations and theiroftemporal
data collection campaigns
evolution in socialon Twit-
data. In [72] J.
[73] P. S.
T.White,
Wood. J. N. Matthews,
Query languagesand J. L. Stacy.
for graph Coalmine:
databases. SIG-
ter: A case study.
International VLDBInWorkshop
On the Move to Meaningful
on Online In-
Social Sys- an experience
MOD Record, in building a Apr.
41(1):5060, system for social media an-
2012.
ternet 2012.
tems, Systems: OTM 2013 Workshops, pages 598607. alytics. In I. V. Ternovskiy and P. Chin, editors, Pro-
2013. [74] S. Wu, J.ofM.
ceedings Hofman,
SPIE, volumeW.8408,
A. Mason,
2012. and D. J. Watts.
[60] V. Plachouras, Y. Stavrakas, and A. Andreou. Assess- Who says what to whom on twitter. In WWW, pages
[61] ing
D. the
Preotiuc-Pietro,
coverage of dataS. collection
Samangooei, and T.
campaigns on Cohn.
Twit- [73] 705714,
P. T. Wood. Query
Mar. 2011.languages for graph databases. SIG-
Trendminer : An architecture
ter: A case study. for real
In On the Move to time analysisIn-
Meaningful of MOD Record, 41(1):5060, Apr. 2012.
social
ternet media text.
Systems: OTMIn Workshop on Real-Time
2013 Workshops, Analysis
pages 598607. [75] X. Yan, P. S. Yu, and J. Han. Graph indexing : A
and
2013.Mining of Social Streams, pages 47, 2012. [74] S. Wu, J.structure-based
frequent M. Hofman, W.approach.
A. Mason, In and D. J. Watts.
SIGMOD, pages
Who says2004.
335346, what to whom on twitter. In WWW, pages
[62] L. Ratinov
[61] D. and D. Roth.
Preotiuc-Pietro, S. Design challenges
Samangooei, andand
T.miscon-
Cohn. 705714, Mar. 2011.
ceptions in named
Trendminer entity recognition.
: An architecture for realIntime
Conference
analysis on
of [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
Computational Natural
social media text. Language
In Workshop on Learning
Real-Time(CoNLL),
Analysis [75] X. Yan, P. situation
emergency S. Yu, and J. Han. via
awareness Graph indexing : In
microbloggers. A
number June,
and Mining of pages
Social147155,
Streams,2009.
pages 47, 2012. frequentpages
CIKM, structure-based
27012703, approach.
2012. In SIGMOD, pages
335346, 2004.
[62] L. Ratinov and D. Roth. Design challenges and miscon-
ceptions in named entity recognition. In Conference on [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
Computational Natural Language Learning (CoNLL), emergency situation awareness via microbloggers. In
number June, pages 147155, 2009. CIKM, pages 27012703, 2012.

SIGKDD Explorations Volume 16, Issue 1 Page 20


What is Tumblr: A Statistical Overview and Comparison

Yi Chang , Lei Tang , Yoshiyuki Inagaki , Yan Liu



Yahoo Labs, Sunnyvale, CA 94089
@WalmartLabs, San Bruno, CA 94066


University of Southern California, Los Angeles, CA 90089
yichang@yahoo-inc.com, leitang@acm.org, inagakiy@yahoo-inc.com, yanliu.cs@usc.edu

ABSTRACT trary, popular social networking sites like Facebook6 , have richer
social interactions, but lower quality content comparing with blo-
Tumblr, as one of the most popular microblogging platforms, has
gosphere. Since most social interactions are either unpublished or
gained momentum recently. It is reported to have 166.4 millions of
less meaningful for the majority of public audience, it is natural for
users and 73.4 billions of posts by January 2014. While many arti-
Facebook users to form different communities or social circles. Mi-
cles about Tumblr have been published in major press, there is not
croblogging services, in between of traditional blogging and online
much scholar work so far. In this paper, we provide some pioneer
social networking services, have intermediate quality content and
analysis on Tumblr from a variety of aspects. We study the social
intermediate social interactions. Twitter7 , which is the largest mi-
network structure among Tumblr users, analyze its user generated
croblogging site, has the limitation of 140 characters in each post,
content, and describe reblogging patterns to analyze its user be-
and the Twitter following relationship is not reciprocal: a Twitter
havior. We aim to provide a comprehensive statistical overview of
user does not need to follow back if the user is followed by another.
Tumblr and compare it with other popular social services, including
As a result, Twitter is considered as a new social media [11], and
blogosphere, Twitter and Facebook, in answering a couple of key
short messages can be broadcasted to a Twitter users followers in
questions: What is Tumblr? How is Tumblr different from other
real time.
social media networks? In short, we nd Tumblr has more rich
content than other microblogging platforms, and it contains hybrid Tumblr is also posed as a microblogging platform. Tumblr users
characteristics of social networking, traditional blogosphere, and can follow another user without following back, which forms a non-
social media. This work serves as an early snapshot of Tumblr that reciprocal social network; a Tumblr post can be re-broadcasted by
later work can leverage. a user to its own followers via reblogging. But unlike Twitter, Tum-
blr has no length limitation for each post, and Tumblr also supports
multimedia post, such as images, audios or videos. With these dif-
1. INTRODUCTION ferences in mind, are the social network, user generated content, or
user behavior on Tumblr dramatically different from other social
Tumblr, as one of the most prevalent microblogging sites, has be-
media sites?
come phenomenal in recent years, and it is acquired by Yahoo! in
2013. By mid-January 2014, Tumblr has 166.4 millions of users In this paper, we provide a statistical overview over Tumblr from
and 73.4 billions of posts1 . It is reported to be the most popular assorted aspects. We study the social network structure among
social site among young generation, as half of Tumblrs visitor are Tumblr users and compare its network properties with other com-
under 25 years old2 . Tumblr is ranked as the 16th most popular monly used ones. Meanwhile, we study content generated in Tum-
sites in United States, which is the 2nd most dominant blogging blr and examine the content generation patterns. One step further,
site, the 2nd largest microblogging service, and the 5th most preva- we also analyze how a blog post is being reblogged and propagated
lent social site3 . In contrast to the momentum Tumblr gained in through a network, both topologically and temporally. Our study
recent press, little academic research has been conducted over this shows that Tumblr provides hybrid microblogging services: it con-
burgeoning social service. Naturally questions arise: What is Tum- tains dual characteristics of both social media and traditional blog-
blr? What is the difference between Tumblr and other blogging or ging. Meanwhile, surprising patterns surface. We describe these
social media sites? intriguing ndings and provide insights, which hopefully can be
leveraged by other researchers to understand more about this new
Traditional blogging sites, such as Blogspot4 and Live Journal5 ,
form of social media.
have high quality content but little social interactions. Nardi et
al. [17] investigated blogging as a form of personal communica-
tion and expression, and showed that the vast majority of blog posts 2. TUMBLR AT FIRST SIGHT
are written by ordinary people with a small audience. On the con-
Tumblr is ranked the second largest microblogging service, right
after Twitter, with over 166.4 million users and 73.4 billion posts
1
http://www.tumblr.com/about by January 2014. Tumblr is easy to register, and one can sign up
2
http://www.webcitation.org/64UXrbl8H for Tumblr service with a valid email address within 30 seconds.
3
http://www.alexa.com/topsites/countries/US Once sign in Tumblr, a user can follow other users. Different from
4
http://blogspot.com Facebook, the connections in Tumblr do not require mutual conr-
5
http://livejournal.com mation. Hence the social network in Tumblr is unidirectional.
6
http://facebook.com
7
http://twitter.com

SIGKDD Explorations Volume 16, Issue 1 Page 21


Both Twitter and Tumblr are considered as microblogging plat- Photo: 78.11%
Text: 14.13%
forms. Comparing with Twitter, Tumblr exposes several differ- Quote: 2.27%
ences: Audio: 2.01%
Video: 1.35%
Chat: 0.85%
There is no length limitation for each post; Answer: 0.82%
Link: 0.46%
Tumblr supports multimedia posts, such as images, audios
and videos;

Similar to hashtags in Twitter, bloggers can also tag their


blog post, which is commonplace in traditional blogging.
But tags in Tumblr are seperate from blog content, while in
Twitter the hashtag can appear anywhere within a tweet.

Tumblr recently (Jan. 2014) allowed users to mention and


link to specic users inside posts. This @user mechanism
needs more time to be adopted by the community; Figure 2: Distribution of Posts (Better viewed in color)
Tumblr does not differentiate veried account.
3.1 billion edges. Though this graph is not yet up-to-date, we be-
lieve that many network properties should be well preserved given
the scale of this graph. Meanwhile, we sample about 586.4 million
of Tumblr posts from August 10 to September 6, 2013. Unfortu-
nately, Tumblr does not require users to ll in basic prole infor-
mation, such as gender or location. Therefore, it is impossible for
us to conduct user prole analysis as done in other works. In or-
Figure 1: Post Types in Tumblr der to handle such large volume of data, most statistical patterns
are computed through a MapReduce cluster, with some algorithms
Specically, Tumblr denes 8 types of posts: photo, text, quote, being tricky. We will skip the involved implementation details but
audio, video, chat, link and answer. As shown in Figure 1, one concentrate solely on the derived patterns.
has the exibility to start a post in any type except answer. Text, Most statistical patterns can be presented in three different forms:
photo, audio, video and link allow one to post, share and comment probability density function (PDF), cumulative distribution func-
any multimedia content. Quote and chat, which are not available tion (CDF) or complementary cumulative distribution function (CCDF),
in most other social networking platforms, let Tumblr users share describing P r(X = x), P r(X x) and P r(X x) respec-
quote or chat history from ichat or msn. Answer occurs only when tively, where X is a random variable and x is certain value. Due
one tries to interact with other users: when one user posts a ques- to the space limit, it is impossible to include all of them. Hence,
tion, in particular, writes a post with text box ending with a question we decide which form(s) to include depending on presentation and
mark, the user can enable the option for others to answer the ques- comparison convenience with other relevant papers. That is, if
tion, which will be disabled automatically after 7 days. A post can CCDF is reported in a relevant paper, we try to also report CCDF
also be reblogged by another user to broadcast to his own follow- here so that rigorous comparison is possible.
ers. The reblogged post will quote the original post by default and Next, we study properties of Tumblr through different lenses, in
allow the reblogger to add additional comments. particular, as a social network, a content generation website, and
Figure 2 demonstrates the distribution of Tumblr post types, based an information propagation platform, respectively.
on 586.4 million posts we collected. As seen in the gure, even
though all kinds of content are supported, photo and text dominate
the distribution, accounting for more than 92% of the posts. There- 3. TUMBLR AS SOCIAL NETWORK
fore, we will concentrate on these two types of posts for our content We begin our analysis of Tumblr by examining its social network
analysis later. topology structure. Numerous social networks have been analyzed
Since Tumblr has a strong presence of photos, it is natural to com- in the past, such as traditional blogosphere [21], Twitter [10; 11],
pare it to other photo or image based social networks like Flickr8 Facebook [22], and instant messenger communication network [13].
and Pinterest9 . Flickr is mainly an image hosting website, and Here we run an array of standard network analysis to compare with
Flicker users can add contact, comment or like others photos. Yet, other networks, with results summarized in Table 110 .
different from Tumblr, one cannot reblog anothers photo in Flickr. Degree Distribution. Since Tumblr does not require mutual con-
Pinterest is designed for curators, allowing one to share photos or rmation when one follows another user, we represent the follower-
videos of her taste with the public. Pinterest links a pin to the followee network in Tumblr as a directed graph: in-degree of a user
commercial website where the product presented in the pin can represents how many followers the user has attracted, while out-
be purchased, which accounts for a stronger e-commerce behavior. degree indicates how many other users one user has been following.
Therefore, the target audience of Tumblr and Pinterest are quite Our sampled sub-graph contains 62.8 million nodes and 3.1 billion
different: the majority of users in Tumblr are under age 25, while
Even though we wish to include results over other popular social
10
Pinterest is heavily used by women within age from 25 to 44 [16]. media networks like Pinterest, Sina Weibo and Instagram, analysis
We directly sample a sub-graph snapshot of social network from over those websites not available or just small-scale case studies
Tumblr on August 2013, which contains 62.8 million nodes and that are difcult to generalize to a comprehensive scale for a fair
comparison. Actually in the Table, we observe quite a discrepancy
8
http://ickr.com between numbers reported over a small twitter data set and another
9
http://pinterest.com comprehensive snapshot.

SIGKDD Explorations Volume 16, Issue 1 Page 22


Table 1: Comparison of Tumblr with other popular social networks. The numbers of Blogosphere, Twitter-small, Twitter-huge, Facebook,
and MSN are obtained from [21; 10; 11; 22; 13], respectively. In the table, implies the corresponding statistic is not available or not
applicable; GCC denotes the giant connected component; the symbols in parenthesis m, d, e, r respectively represent mean, median, the
90% effective diameter, and diameter (the maximum shortest path in the network).
Metric Tumblr Blogosphere Twitter-small Twitter-huge Facebook MSN
#nodes 62.8M 143,736 87,897 41.7M 721M 180M
#links 3.1B 707,761 829,467 1.47B 68.7B 1.3B
in-degree distr k2.19 k2.38 k2.4 k2.276
degree distr in r-graph = power-law = power-law k0.8 e0.03k
direction directed directed directed directed undirected undirected
reciprocity 29.03% 3% 58% 22.1%
degree correlation 0.106 >0 0.226
avg distance 4.7(m), 5(d) 9.3(m) 4.1(m), 4(d) 4.7(m), 5(d) 6.6(m), 6(d)
diameter 5.4(e), 29(r) 12(r) 6(r) 4.8(e), 18(r) < 5(e) 7.8(e), 29(r)
GCC coverage 99.61% 75.08% 93.03% 99.91% 99.90%

edges. Within this social graph, 41.40% of nodes have 0 in-degree, point due to the Tumblr limit of 5000 followees for ordinary users.
and the maximum in-degree of a node is 4.06 million. By con- The reciprocity relationship on Tumblr does not follow the power
trast, 12.74% of nodes have 0 out-degree, the maximum out-degree law distribution, since the curve mostly is convex, similar to the
of a node is 155.5k. Top popular Tumblr users include equipo11 , pattern reported over Facebook[22].
instagram12 , and woodendreams13 . This indicates the media char- Meanwhile, it has been observed that ones degree is correlated
acteristic of Tumblr: the most popular user has more than 4 million with the degree of his friends. This is also called degree correlation
audience, while more than 40% of users are purely audience since or degree assortativity [18; 19]. Over the derived r-graph, we obtain
they dont have any followers. a correlation of 0.106 between terminal nodes of reciprocate con-
Figure 3(a) demonstrates the distribution of in-degrees in the blue nections, reconrming the positive degree assortativity as reported
curve and that of out-degrees in the red curve, where y-axis refers in Twitter [11]. Nevertheless, compared with the strong social net-
to the cumulated density distribution function (CCDF): the proba- work Facebook, Tumblrs degree assortativity is weaker (0.106 vs.
bility that accounts have at least k in-degrees or out-degrees, i.e., 0.226).
P (K >= k). It is observed that Tumblr users in-degree follows Degree of Separation. Small world phenomenon is almost uni-
a power-law distribution with exponent 2.19, which is quite sim- versal among social networks. With this huge Tumblr network,
ilar from the power law exponent of Twitter at 2.28 [11] or that we are able to validate the well-known six degrees of separation
of traditional blogs at 2.38 [21]. This also conrms with earlier as well. Figure 4 displays the distribution of the shortest paths in
empirical observation that most social network have a power-law the network. To approximate the distribution, we randomly sample
exponent between 2 and 3 [6]. 60,000 nodes as seed and calculate for each node the shortest paths
In regard to out-degree distribution, we notice the red curve has a to other nodes. It is observed that the distribution of paths length
big drop when out-degree is around 5000, since there was a limit reaches its mode with the highest probability at 4 hops, and has a
that ordinary Tumblr users can follow at most 5000 other users. median of 5 hops. On average, the distance between two connected
Tumblr users out-degree does not follow a power-law distribution, nodes is 4.7. Even though the longest shortest path in the approxi-
which is similar to blogosphere of traditional blogging [21]. mation has 29 hops, 90% of shortest paths are within 5.4 hops. All
If we explore users in-degree and out-degree together, we could these numbers are close to those reported on Facebook and Twitter,
generate normalized 3-D histogram in Figure 3(b). As both in- yet signicantly smaller than that obtained over blogosphere and
degree and out-degree follow the heavy-tail distribution, we only instant messenger network [13].
zoom in those user who have less than 210 in-degrees and out- Component Size. The previous result shows that those users who
degrees. Apparently, there is a positive correlation between in- are connected have a small average distance. It relies on the as-
degree and out-degree because of the dominance of diagonal bars. sumption that most users are connected to each other, which we
In aggregation, a user with low in-degree tends to have low out- shall conrm immediately. Because the Tumblr graph is directed,
degree as well, even though some nodes, especially those top pop- we compute out all weakly-connected components by ignoring the
ular ones, have very imbalanced in-degree and out-degree. direction of edges. It turns out the giant connected component
Reciprocity. Since Tumblr is a directed network, we would like to (GCC) encompasses 99.61% of nodes in the graph. Over the de-
examine the reciprocity of the graph. We derive the backbone of the rived r-graph, 97.55% are residing in the corresponding GCC. This
Tumblr network by keeping those reciprocal connections only, i.e., nding suggests the whole graph is almost just one connected com-
user a follows b and vice versa. Let r-graph denote the correspond- ponent, and almost all users can reach others through just few hops.
ing reciprocal graph. We found 29.03% of Tumblr user pairs have To give a palpable understanding, we summarize commonly used
reciprocity relationship, which is higher than 22.1% of reciprocity network statistics in Table 1. Those numbers from other popular
on Twitter [11] and 3% of reciprocity on Blogosphere [21], indicat- social networks (blogosphere, Twitter, Facebook, and MSN) are
ing a stronger interaction between users in the network. Figure 3(c) also included for comparison. From this compact view, it is obvi-
shows the distribution of degrees in the r-graph. There is a turning ous traditional blogs yield a signicantly different network struc-
ture. Tumblr, even though originally proposed for blogging, yields
11
http://equipo.tumblr.com a network structure that is more similar to Twitter and Facebook.
12
http://instagram.tumblr.com
13
http://woodendreams.tumblr.com

SIGKDD Explorations Volume 16, Issue 1 Page 23


0 0
10 0.2 10
InDegree

Percentage of Users
OutDegree
0.15
2 2
10 10
0.1

0.05
CCDF

CCDF
4 4
10 10
0
0
1
6 2 6
10 3 10 10
4 9
5 8
6 7
7 6
5
8
9 3 4
10 2
0 1
8 8
10 X 10
InDegree = 2
0
10
2
10 10
4 6
10 10
8
OutDegree = 2Y 0
10
1
10 10
2
10
3 4
10
5
10
InDegree or OutDegree InDegree (same to OutDegree)

(a) in/out degree distribution (b) in/out degree correlation (c) degree distribution in r-graph

Figure 3: Degree Distribution of Tumblr Network

10
0 Text Post Photo Caption
Dataset Dataset
10
2
# Posts 21.5 M 26.3 M
4
Mean Post Length 426.7 Bytes 64.3 Bytes
10
Median Post Length 87 Bytes 29 Bytes
PDF

10
6 Max Post Length 446.0 K Bytes 485.5 K Bytes

10
8 Table 2: Statistics of User Generated Contents
10
10
0 5 10 15 20 25 30
Shortest Path Length key difference: Tumblr has no length limit while Twitter enforces
1 the strict limitation of 140 bytes for each tweet. How does this key
difference affect user post behavior?
0.8
It has been reported that the average length of posts on Twitter is
67.9 bytes and the median is 60 bytes14 . Corresponding statistics
0.6
of Tumblr are shown in Table 2. For the text post dataset, the aver-
CDF

0.4
age length is 426.7 bytes and the median is 87 bytes, which both,
as expected, are longer than that of Twitter. Keep in mind Tum-
0.2 blrs numbers are obtained after removing all quotes, photos and
URLs, which further discounts the discrepancy between Tumblr
0
0 5 10 15 20 25 30 and Twitter. The big gap between mean and median is due to a
Shortest Path Length
small percentage of extremely long posts. For instance, the longest
text post is 446K bytes in our sampled dataset. As for photo cap-
Figure 4: Shortest Path Distribution tions, naturally we expect it to be much shorter than text posts.
The average length is around 64.3 bytes, but the median is only 29
bytes. Although photo posts are dominant in Tumblr, the number
of text posts and photo captions in Table 2 are comparable, because
4. TUMBLR AS BLOGOSPHERE FOR majority of photo posts dont contain any raw photo captions.
CONTENT GENERATION A further related question: is the 140-byte limit sensible? We plot
As Tumblr is initially proposed for the purpose of blogging, here post length distribution of the text post dataset, and zoom into less
we analyze its user generated contents. As described earlier, photo than 280 bytes in Figure 5. About 24.48% of posts are beyond
and text posts account for more than 92% of total posts. Hence, we 140 bytes, which indicates that at least around one quarter of posts
concentrate only on these two types of posts. One text post may will have to be rewritten in a more compact version if the limit was
contain URL, quote or raw message. In this study, we are mainly enforced in Tumblr.
interested in the authentic contents generated by users. Hence, we Blending all numbers above together, we can see at least two types
extract raw messages as the content information of each text post, of posts: one is more like posting a reference (URL or photo) with
by removing quotes and URLs. Similarly, photo posts contains 3 added information or short comments, the other is authentic user
categories of information: photo URL, quote photo caption, raw generated content like in traditional blogging. In other words, Tum-
photo caption. While the photo URL might contain lots of addi- blr is a mix of both types of posts, and its no-length-limit policy
tional meta information, it would require tremendous effort to ana- encourages its users to post longer high-quality content directly.
lyze all images in Tumblr. Hence, we focus on raw photo captions What are people talking about? Because there is no length limit
as the content of each photo post. We end up with two datasets of on Tumblr, the blog post tends to be more meaningful, which al-
content: one is text post, and the other is photo caption.
Whats the effect of no length limit for post? Both Tumblr and 14
http://www.quora.com/Twitter-1/What-is-the-average-length-of-
Twitter are considered microblogging platforms, yet there is one a-tweet

SIGKDD Explorations Volume 16, Issue 1 Page 24


1
Topic Topical Keywords
Pets cat dog cute upload kitty batch puppy
0.8 pet animal kitten adorable
Scenery summer beach sun sky sunset sea nature
CCDF
0.6
ocean island clouds lake pool beautiful
Pop music song rock band album listen lyrics
0.4
Music punk guitar dj pop sound hip
0.2 Photography photo instagram pic picture check
daily shoot tbt photography
0
0 50 100 150 200 250 300
Sports team world ball win football club
Post Length (Bytes) round false soccer league baseball
Medical body pain skin brain depression hospital
Figure 5: Post Length Distribution teeth drugs problems sick cancer blood

Topic Topical Keywords Table 4: Topical Keywords from Photo Caption Dataset
Pop music song listen iframe band album lyrics
Music video guitar
Sports game play team win video cookie For brevity, we just show the result for text post dataset as similar
ball football top sims fun beat league patterns were observed over photo captions.
Internet internet computer laptop google search online The patterns are strong in both gures. Those users who have
site facebook drop website app mobile iphone higher in-degree tend to post more, in terms of both mean and me-
Pets big dog cat animal pet animals bear tiny dian. One caveat is that what we observe and report here is merely
small deal puppy correlation, and it does not derive causality. Here we draw a con-
Medical anxiety pain hospital mental panic cancer servative conclusion that the social popularity is highly positively
depression brain stress medical correlated with user blog frequency. A similar positive correlation
Finance money pay store loan online interest buying is also observed in Twitter[11].
bank apply card credit In contrast, the pattern in terms of user registration time is beyond
our imagination until we draw the gure. Surprisingly, those users
Table 3: Topical Keywords from Text Post Dataset who either register earliest or register latest tend to post less fre-
quently. Those who are in between are inclined to post more fre-
quently. Obviously, our initial hypothesis about the incentive for
lows us to run topic analysis over the two datasets to have an overview new users to blog more is invalid. There could be different expla-
of the content. We run LDA [4] with 100 topics on both datasets, nations in hindsight. Rather than guessing the underlying explana-
and showcase several topics and their corresponding keywords on tion, we decide to leave this phenomenon as an open question to
Tables 3 and 4, which also show the high quality of textual content future researchers.
on Tumblr clearly. Medical, Pets, Pop Music, Sports are shared in- As for reference, we also look at average post-length of users, be-
terests across 2 different datasets, although representative topical cause it has been adopted as a simple metric to approximate quality
keywords might be different even for the same topic. Finance, In- of blog posts [1]. The corresponding correlations are plot in Fig-
ternet only attracts enough attentions from text posts, while only ure 7. In terms of post length, the tail users in social networks are
signicant amount of photo posts show interest to Photography, the winner. Meanwhile, long-term or recently-joined users tend to
Scenery topics. We want to emphasize that most of these keywords post longer blogs. Apparently, this pattern is exactly opposite to
are semantically meaningful and representative of the topics. post frequency. That is, the more frequent one blogs, the shorter
Who are the major contributors of contents? There are two po- the blog post is. And less frequent bloggers tend to have longer
tential hypotheses. 1) One supposes those socially popular users posts. That is totally valid considering each individual has limited
post more. This is derived from the result that those popular users time and resources. We even changed the post length to the max-
are followed by many users, therefore blogging is one way to at- imum for each individual user rather than average, but the pattern
tract more audience as followers. Meanwhile, it might be true that remains still.
blogging is an incentive for celebrities to interact or reward their In summary, without the post length limitation, Tumblr users are
followers. 2) The other assumes that long-term users (in terms of inclined to write longer blogs, and thus leading to higher-quality
registration time) post more, since they are accustomed to this ser- user generated content, which can be leveraged for topic analysis.
vice, and they are more likely to have their own focused commu- The social celebrities (those with large number of followers) are
nities or social circles. These peer interactions encourage them to the main contributors of contents, which is similar to Twitter [24].
generate more authentic content to share with others. Surprisingly, long-term users and recently-registered users tend to
Do socially popular users or long-term users generate more con- blog less frequently. The post-length in general has a negative cor-
tents? In order to answer this question, we choose a xed time relation with post frequency. The more frequently one posts, the
window of two weeks in August 2013 and examine how frequent shorter those posts tend to be.
each user blogs on Tumblr. We sort all users based on their in-
degree (or duration time since registration) and then partition them
into 10 equi-width bins. For each bin, we calculate the average
5. TUMBLR FOR INFORMATION PROPA-
blogging frequency. For easy comparison, we consider the maxi- GATION
mal value of all bins as 1, and normalize the relative ratio for other Tumblr offers one feature which is missing in traditional blog ser-
bins. The results are displayed in Figure 6, where x-axis from left to vices: reblog. Once a user posts a blog, other users in Tumblr can
right indicates increasing in-degree (or decreasing duration time). reblog to comment or broadcast to their own followers. This en-

SIGKDD Explorations Volume 16, Issue 1 Page 25


1.2 1.2
Mean of Post Frequency Mean of Post Length
Median of Post Frequency Median of Post Length
1 1
Normalized Post Frequency

Normalized Post Length


0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
InDegree from Low to High along xAxis InDegree from Low to High along xAxis

1.2 1.2
Mean of Post Frequency Mean of Post Length
Median of Post Frequency Median of Post Length
1 1
Normalized Post Frequency

Normalized Post Length


0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
Registration Time from Early to Late along xAxis Registration Time from Early to Late along xAxis

Figure 6: Correlation of Post Frequency with User In-degree or Figure 7: Correlation of Post Length with User In-degree or Dura-
Duration Time since Registration tion Time since Registration

ables information to be propagated through the network. In this of reblog cascade involving few reblog events. Yet, within a time
section, we examine the reblogging patterns in Tumblr. We exam- window of two weeks, the maximum cascade could reach 116.6K.
ine all blog posts uploaded within the rst 2 weeks, and count re- In order to have a detailed understanding of reblog cascades, we
blog events in the subsequent 2 weeks right after the blog is posted, zoom into the short head and plot the CCDF up to reblog cascade
so that there would be no bias because of the time window selection size equivalent to 20 in Figure 9. It is observed that only about
in our blog data. 19.32% of reblog cascades have size greater than 10. By contrast,
Who are reblogging? Firstly, we would like to understand which only 1% of retweet cascades have size larger than 10 [11]. The re-
users tend to reblog more? Those people who reblog frequently blog cascades in Tumblr tend to be larger than retweet cascades in
serves as the information transmitter. Similar to the previous sec- Twitter.
tion, we examine the correlation of reblogging behavior with users Reblog depth distribution. As shown in previous sections, almost
in-degree. As shown in the Figure 8, social celebrities, who are the any pair of users are connected through few hops. How many hops
major source of contents, reblog a lot more compared with other does one blog to propagate to another user in reality? Hence, we
users. This reblogging is propagated further through their huge look at the reblog cascade depth, the maximum number of nodes to
number of followers. Hence, they serve as both content contrib- pass in order to reach one leaf node from the root node in the reblog
utor and information transmitter. On the other hand, users who cascade structure. Note that reblog depth and size are different. A
registered earlier reblog more as well. The socially popular and cascade of depth 2 can involve hundreds of nodes if every other
long-term users are the backbone of Tumblr network to make it a node in the cascade reblogs the same root node.
vibrant community for information propagation and sharing. Figure 10 plots the distribution of number of hops: again, the reblog
Reblog size distribution. Once a blog is posted, it can be re- cascade depth distribution follows a power law as well according
blogged by others. Those reblogs can be reblogged even further, to the PDF; when zooming into the CCDF, we observe that only
which leads to a tree structure, which is called reblog cascade, with 9.21% of reblog cascades have depth larger than 6. That is, major-
the rst author being the root node. The reblog cascade size indi- ity of cascades can reach just few hops, which is consistent with the
cates the number of reblog actions that have been involved in the ndings reported over Twitter [3]. Actually, 53.31% of cascades in
cascade. Figure 9 plots the distribution of reblog cascade sizes. Tumblr have depth 2. Nevertheless, the maximum depth among all
Not surprisingly, it follows a power-law distribution, with majority cascades can reach 241 based on two week data. This looks un-

SIGKDD Explorations Volume 16, Issue 1 Page 26


0
10
1.2
Mean of Reblog Frequency
Median of Reblog Frequency 10
2

1
Normalized Reblog Frequency

PDF
4
10
0.8

6
10
0.6

8
10 0 2 4 6
0.4 10 10 10 10
Reblog Cascade Size

0.2 1

0.8
0
InDegree from Low to High along xAxis
0.6

CCDF
1.2 0.4
Mean of Reblog Frequency
Median of Reblog Frequency 0.2
1
Normalized Reblog Frequency

0
0 5 10 15 20 25
0.8 Reblog Cascade Size

0.6 Figure 9: Distribution of Reblog Cascade Size

0.4
is posted, the less likely it would be reblogged. 75.03% of rst re-
blog arrive within the rst hour since a blog is posted, and 95.84%
0.2
of rst reblog appears within one day. Comparatively, It has been
reported that half of retweeting occurs within an hour and 75%
0
Registration Time from Early to Late along xAxis under a day [11] on Twitter. In short, Tumblr reblog has a strong
bias toward recency, and information propagation on Tumblr is fast.

Figure 8: Correlation of Reblog Frequency with User In-degree or


Duration Time since Registration
6. RELATED WORK
There are rich literatures on both existing and emerging online so-
cial network services. Statistical patterns across different types of
social networks are reported, including traditional blogosphere [21],
likely at rst glimpse, considering any two users are just few hops user-generated content platforms like Flickr, Youtube and Live-
away. Indeed, this is because users can add comment while reblog- Journal [15], Twitter [10; 11], instant messenger network [13],
ging, and thus one user is likely to involve in one reblog cascade Facebook [22], and Pinterest [7; 20]. Majority of them observe
multiple times. We notice that some Tumblr users adopt reblog as shared patterns such as long tail distribution for user degrees (power
one way for conversation or chat. law or power law with exponential cut-off), small (90% quantile ef-
Reblog Structure Distribution. Since most reblog cascades are fective) diameter, positive degree association, homophily effect in
few hops, here we show the cascade tree structure distribution up terms of user proles (age or location), but not with respect to gen-
to size 5 in Figure 11. The structures are sorted based on their cov- der. Indeed, people are more likely to talk to the opposite sex [13].
erage. Apparently, a substantial percentage of cascades (36.05%) The recent study of Pinterest observed that ladies tend to be more
are of size 2, i.e., a post being reblogged merely once. Generally active and engaged than men [20], and women and men have differ-
speaking, a reblog cascade of a at structure tends to have a higher ent interests [5]. We have compared Tumblrs patterns with other
probability than a reblog cascade of the same size but with a deep social networks in Table 1 and observed that most of those trend
structure. For instance, a reblog cascade of size 3 have two vari- hold in Tumblr except for some number difference.
ants, of which the at one covers 9.42% cascade while the deep Lampe et al. [12] did a set of survey studies on Facebook users,
one drops to 5.85%. The same patten applies to reblog cascades and shown that people use Facebook to maintain existing ofine
of size 4 and 5. In other words, it is easier to spread a message connections. Java et al. [10] presented one of the earliest re-
widely rather than deeply in general. This implies that it might be search paper for Twitter, and found that users leverage Twitter to
acceptable to consider only the cascade effect under few hops and talk their daily activities and to seek or share information. In ad-
focus those nodes with larger audience when one tries to maximize dition, Schwartz [7] is one of the early studies on Pinterest, and
inuence or information propagation. from a statistical point of view that female users repin more but
Temporal patten of reblog. We have investigated the information with fewer followers than male users. While Hochman and Raz [8]
propagation spatially in terms of network topology, now we study published an early paper using Instagram data, and indicated differ-
how fast for one blog to be reblogged? Figure 12 displays the dis- ences in local color usage, cultural production rate, for the analysis
tribution of time gap between a post and its rst reblog. There is of location-based visual information ows.
a strong bias toward recency. The larger the time gap since a blog Existing studies on user inuence are based on social networks or

SIGKDD Explorations Volume 16, Issue 1 Page 27


36.05% 9.42% 5.85% 3.58% 1.69% 1.44% 2.78% 1.20% 1.15% 0.58% 0.51% 0.42% 0.33% 0.31% 0.24% 0.21%

Figure 11: Cascade Structure Distribution up to Size 5. The percentage at the top is the coverage of cascade structure.

0
10 1

2 0.8
10

0.6
PDF

CDF
4
10
0.4
6
10
0.2

8
10 0 1 2 3 0
10 10 10 10 1m 10m 1h 1d 1w
Reblog Cascade Depth Lag Time of First Reblog

1
Figure 12: Distribution of Time Lag between a Blog and its rst
0.8
Reblog

7. CONCLUSIONS AND FUTURE WORK


0.6
CCDF

0.4 In this paper, we provide a statistical overview of Tumblr in terms


of social network structure, content generation and information prop-
0.2
agation. We show that Tumblr serves as a social network, a blo-
gosphere and social media simultaneously. It provides high qual-
0
0 5 10 15
Reblog Cascade Depth
20 25 ity content with rich multimedia information, which offers unique
characteristics to attract youngsters. Meanwhile, we also summa-
rize and offer as rigorous comparison as possible with other social
Figure 10: Distribution of Reblog Cascade Depth services based on numbers reported in other papers. Below we
highlight some key ndings:
With multimedia support in Tumblr, photos and text account
content analysis. McGlohon et al. [14] found topology features for majority of blog posts, while audios and videos are still
can help us distinguish blogs, the temporal activity of blogs is very rare.
non-uniform and bursty, but it is self-similar. Bakshy et al. [3]
investigated the attributes and relative inuence based on Twitter Tumblr, though initially proposed for blogging, yields a sig-
follower graph, and concluded that word-of-mouth diffusion can nicantly different network structure from traditional blogo-
only be harnessed reliably by targeting large numbers of potential sphere. Tumblrs network is much denser and better con-
inuencers, thereby capturing average effects. Hopcroft et al. [9] nected. Close to 29.03% of connections on Tumblr are re-
studied the Twitter user inuence based on two-way reciprocal rela- ciprocate, while blogosphere has only 3%. The average dis-
tionship prediction. Weng et al. [23] extended PageRank algorithm tance between two users in Tumblr is 4.7, which is roughly
to measure the inuence of Twitter users, and took both the topi- half of that in blogosphere. The giant connected component
cal similarity between users and link structure into account. Kwak covers 99.61% of nodes as compared to 75% in blogosphere.
et al. [11] study the topological and geographical properties on Tumblr network is highly similar to Twitter and Facebook,
the entire Twittersphere and they observe some notable properties with power-law distribution for in-degree distribution, non-
of Twitter, such as a non-power-law follower distribution, a short power law out-degree distribution, positive degree associa-
effective diameter, and low reciprocity, marking a deviation from tivity for reciprocate connections, small distance between
known characteristics of human social networks. connected nodes, and a dominant giant connected compo-
However, due to data access limitation, majority of the existing nent.
scholar papers are based on either Twitter data or traditional blog-
ging data. This work closes the gap by providing the rst overview Without post length limitation, Tumblr users tend to post
of Tumblr so that others can leverage as a stepstone to investigate longer. Approximately 1/4 of text posts have authentic con-
more over this evolving social service or compare with other related tents beyond 140 bytes, implying a substantial portion of
services. high quality blog posts for other tasks like topic

SIGKDD Explorations Volume 16, Issue 1 Page 28


Those social celebrities tend to be more active. They post [8] N. Hochman and R. Schwartz. Visualizing instagram: Tracing
analysis and text mining. and reblog more frequently, serv- cultural visual rhythms. In Proceedings of the Workshop on
ing as both content generators and information transmitters. Social Media Visualization (SocMedVis) in conjunction with
Moreover, frequent bloggers like to write short, while infre- ICWSM, 2012.
quent bloggers spend more effort in writing longer posts.
[9] J. E. Hopcroft, T. Lou, and J. Tang. Who will follow you
In terms of duration since registration, those long-term users back?: reciprocal relationship prediction. In Proceedings of
and recently registered users post less frequently. Yet, long- ACM International Conference on Information and Knowl-
term users reblog more. edge Management (CIKM), pages 11371146, 2011.

Majority of reblog cascades are tiny in terms of both size [10] A. Java, X. Song, T. Finin, and B. Tseng. Why we twit-
and depth, though extreme ones are not uncommon. It is rel- ter: understanding microblogging usage and communities. In
atively easier to propagate a message wide but shallow rather WebKDD/SNA-KDD 07, pages 5665, New York, NY, USA,
than deep, suggesting the priority for inuence maximization 2007. ACM.
or information propagation. [11] H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a
social network or a news media. In Proceedings of 19th Inter-
Compared with Twitter, Tumblr is more vibrant and faster in
national World Wide Web Conference (WWW), 2010.
terms of reblog and interactions. Tumblr reblog has a strong
bias toward recency. Approximately 3/4 of the rst reblogs [12] C. Lampe, N. Ellison, and C. Steineld. A familiar
occur within the rst hour and 95.84% appear within one face(book): Prole elements as signals in an online social net-
day. work. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 2007.
This snapshot research is by no means to be complete. There are
several directions to extend this work. First, some patterns de- [13] J. Leskovec and E. Horvitz. Planetary-scale views on a large
scribed here are correlations. They do not illustrate the underlying instant-messaging network. In WWW 08: Proceeding of the
mechanism. It is imperative to differentiate correlation and causal- 17th international conference on World Wide Web, pages
ity [2] so that we can better understand the user behavior. Secondly, 915924, New York, NY, USA, 2008. ACM.
it is observed that Tumblr is very popular among young users, as [14] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. S.
half of Tumblrs visitor base being under 25 years old. Why is it Glance. Finding patterns in blog shapes and blog evolution.
so? We need to combine content analysis, social network analysis, In Proceedings of ICWSM, 2007.
together with user proles to gure out. In addition, since more
than 70% of Tumblr posts are images, it is necessary to go beyond [15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and
photo captions, and analyze image content together with other meta B. Bhattacharjee. Measurement and analysis of online social
information. networks. In IMC 07: Proceedings of the 7th ACM SIG-
COMM conference on Internet measurement, pages 2942,
New York, NY, USA, 2007. ACM.
8. REFERENCES
[16] S. Mittal, N. Gupta, P. Dewan, and P. Kumaraguru. The pin-
[1] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the in- bang theory: Discovering the pinterest world. arXiv preprint
uential bloggers in a community. In Proceedings of WSDM, arXiv:1307.4952, 2013.
2008. [17] B. Nardi, D. J. Schiano, S. Gumbrecht, and L. Swartz. Why
[2] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Inuence we blog. Commun. ACM, 47(12):4146, 2004.
and correlation in social networks. In Proceedings of KDD, [18] M. E. J. Newman. Assortative mixing in networks. Physical
2008. review letters, 89(20): 208701, 2002.
[3] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Ev- [19] M. E. J. Newman. Mixing patterns in networks. Physical Re-
eryones an inuencer: quantifying inuence on twitter. In view E, 67(2): 026126, 2003.
Proceedings of WSDM, 2011.
[20] R. Ottoni, J. P. Pesce, D. Las Casas, G. Franciscani, P. Ku-
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan:. Latent dirichlet allo- maruguru, and V. Almeida. Ladies rst: Analyzing gen-
cation. Journal of Machine Learning Research, 3:9931022, der roles and behaviors in pinterest. Proceedings of ICWSM,
2003. 2013.

[5] S. Chang, V. Kumar, E. Gilbert, and L. Terveen. Specializa- [21] X. Shi, B. Tseng, , and L. A. Adamic. Looking at the blogo-
tion, homophily, and gender in a social curation site: Findings sphere topology through different lenses. In Proceedings of
from pinterest. In Proceedings of The 17th ACM Conference ICWSM, 2007.
on Computer Supported Cooperative Work and Social Com- [22] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow.
puting, CSCW14, 2014. The anatomy of the facebook social graph. arXiv preprint
arXiv:1111.4503, 2011.
[6] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law
distributions in empirical data. arXiv, 706, 2007. [23] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: nd-
ing topic-sensitive inuential twitterers. In Proceedings of
[7] E. Gilbert, S. Bakhshi, S. Chang, and L. Terveen. i need to WSDM, pages 11371146, 2010.
try this!: A statistical overview of pinterest. In Proceedings
of the SIGCHI Conference on Human Factors in Computing [24] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who
Systems (CHI), 2013. says what to whom on twitter. In Proceedings of WWW 2011.

SIGKDD Explorations Volume 16, Issue 1 Page 29


Change Detection in Streaming Data in the Era of Big
Change Detection
Data:inModels
Streaming
and Data in the Era of Big
Issues
Data: Models and Issues
Dang-Hoan Tran Mohamed Medhat Gaber Kai-Uwe Sattler
Dang-Hoan
Vietnam Tran
Maritime University Mohamed MedhatScience
School of Computing Gaber Kai-Uwe
Ilmenau Sattlerof
University
484 Lach
Vietnam Tray, Ngo
Maritime Quyen
and Digital
University
School Media, Robert
of Computing Science Technology
Ilmenau University of
484Haiphong,
Lach Tray, Vietnam
Ngo Quyen
andGordon
Digital Media, Robert
University POTechnology
Box 100565
Haiphong, Vietnam Gordon
Riverside EastUniversity
Garthdee Road PO Box Germany
Ilmenau, 100565
dang- Riverside East Garthdee Road
Aberdeen Ilmenau, Germany
dang-
hoan.tran@vimaru.edu.vn Aberdeen kus@tu-ilmenau.de
AB10 7GJ, UK kus@tu-ilmenau.de
hoan.tran@vimaru.edu.vn AB10 7GJ, UK
m.gaber1@rgu.ac.uk
m.gaber1@rgu.ac.uk
ABSTRACT real-world systems such as InforSphere Streams (IBM)1 , Rapid-
ABSTRACT real-world
miner systems
Streams such2 ,asStreamBase
Plugin InforSphere3 ,Streams
MOA 4(IBM) , AnduIN
1 , Rapid-
5 . In
Big Data is identified by its three Vs, namely velocity, volume, miner Streams Plugin 2 , StreamBase 3 4 5 . In
Big Data is identified by data
its three Vs, processing
namely velocity, volume, order to deal with the high-speed data ,streams,
MOA a, AnduIN hybrid model
and variety. The area of stream has long dealt order to deal with the high-speed data streams, a hybrid model
and that combines the advantages of both parallel batch processing
with variety. The two
the former areaVsofvelocity
data stream processing
and volume. Overhasa long dealt
decade of that combines the advantages of both model
parallelis batch processing
with the former two Vs velocity and volume. Over a decade of model and streaming data processing proposed. Some
intensive research, the community has provided many important model
intensive research, theincommunity has provided projectsand for streaming
such hybrid data processing
model include model is proposed.
S4 66 , Storm 7 , and GrokSome8
research discoveries the area. The third V ofmanyBig important
Data has projects for such hybrid model include S4 , Storm 7 , and Grok 8
research .
been the discoveries
result of socialin the area.and
media Thethethird
largeV unstructured
of Big Data data
has
.One of these challenges facing data stream processing and min-
been
it the result
generates. of socialtechniques
Streaming media andhave the large unstructured
also been proposeddata
re-
it generates. Streaming techniques have also been proposed re- One
ing isofthethese challenges
changing naturefacing data stream
of streaming data.processing
Therefore,and the min-
abil-
cently addressing this emerging need. However, a hidden factor ing is the changing nature of streaming data. Therefore, the abil-
cently addressing this emerging need. However, a hidden
can represent an important fourth V, that is variability or change. factor ity to identify trends, patterns, and changes in the underlying pro-
can ity to identify
cesses trends,
generating datapatterns, and changes
contributes in the underlying
to the success of processing pro-
Our represent
world is an important
changing fourthand
rapidly, V, that is variability
accounting or change.
to variability is cesses generating data contributes to the success of processing
Our world is changing rapidly, and accounting to variability
a crucial success factor. This paper provides a survey of change is and mining massive high-speed data streams.
adetection
crucial success factor. This paper provides data.
a survey and mining massive high-speed data streams.
techniques as applied to streaming Theofreview
change
is A model of continuous distributed monitoring has been recently
A model of continuous distributed monitoring has been recently
detection techniques as applied to streaming data. The review is proposed to deal with streaming data coming from multiple
timely with the rise of Big Data technologies, and the need to proposed to deal with streaming data coming from multiple
timely with the rise of Big Data technologies, and the need to sources. This model has many observers where each observer
have this important aspect highlighted and its techniques catego- sources. This model has many observers where each observer
have this important aspect highlighted and its techniques catego- monitors a single data stream. The goal of continuous distributed
rized and detailed. monitors a single data stream. The goal of continuous distributed
rized and detailed. monitoring is to perform some tasks that need to aggregate the in-
monitoring is to perform some tasks that need to aggregate the in-
coming data from the observers. The continuous distributed mon-
coming data from the observers. The continuous distributed mon-
itoring is applied to monitor networks such as sensor networks,
1.
1. INTRODUCTION
INTRODUCTION itoring is applied to monitor networks such as sensor networks,
social networks, networks of ISP [11].
Todays world is changing very fast. The changes occur in every social networks, networks of ISP [11].
Todays world is changing very fast. The changes occur in every Change
Change detection
detection is is the
the process
process of of identifying
identifying differences
differences in in the
the
aspects
aspects of of life.
life. Therefore,
Therefore, thethe ability
ability to
to detect,
detect, adapt,
adapt, and
and react
react state
to state ofof an
an object
object or or phenomenon
phenomenon by by observing
observing it it at
at different
different
to the change play an important role in all aspects of life.
the change play an important role in all aspects of life. The
The times
times or or different
different locations
locations in in space.
space. InIn the
the streaming
streaming context,
context,
physical
physical world
world isis often
often represented
represented in in some
some model
model oror some
some infor-
infor- change
mation change detection is the process of segmenting aa data
detection is the process of segmenting data stream
stream into
into
mation system.
system. TheThe changes
changes inin the
the physical
physical world
world are
are reflected
reflected inin different
different segments
segments by by identifying
identifying the the points
points where
where the the stream
stream
terms
terms of the changes in data or model built from data. Therefore,
of the changes in data or model built from data. Therefore, dynamics
the dynamics change
change [53].
[53]. A A change
change detection
detection method
method consists
consists of of
the nature
nature ofof data
data is
is changing.
changing. the
the following
following tasks:
tasks: change
change detection
detection andand localization
localization of of change.
change.
The
The advance of technology results
advance of technology results inin the
the data
data deluge.
deluge. The
The data
data Change
volume Change detection
detection identifies
identifies whether
whether aa change
change occurs,
occurs, and and re-
re-
volume is is increasing
increasing with
with an
an estimated
estimated rate
rate of
of 50%
50% per
per year
year [39].
[39]. sponds
sponds to to the
the presence
presence of of such
such change.
change. Besides
Besides change
change detec-detec-
Data
Data flood
flood makes
makes traditional
traditional methods
methods including
including traditional
traditional dis-
dis- tion,
tributed tion, localization
localization of of changes
changes determines
determines the the location
location of of change.
change.
tributed framework and parallel models inappropriate for
framework and parallel models inappropriate for pro-
pro- The
The problem
problem of of locating
locating the the change
change has has been
been studied
studied in in statistics
statistics
cessing,
cessing, analyzing,
analyzing, storing,
storing, and
and understanding
understanding thesethese massive
massive data
data in
in the
the problems
problems of of change
change point
point detection.
detection.
sets.
sets. Data
Data deluge
deluge needs
needs aa new
new generation
generation of of computing
computing tools
tools that
that This
th This paper
paper presents
presents the the background
background issues
issues and and notation
notation relevant
relevant
Jim
Jim Gray
Gray calls
calls the
the 44th paradigm
paradigm in in scientific
scientific computing
computing [25].
[25]. Re-
Re- to the problem of change detection in data streams.
cently, there have been some emerging computing paradigms that
meet the requirements of Big Data as follows. Parallel batch pro-
cessing model only deals with the stationary massive data [17]. 2. CHANGE DETECTION IN STREAM-
However, evolving data continuously arrives with high speed. In ING DATA
fact, online data stream processing is the main approach to deal-
ing with the problem of three characteristics of Big Data includ- 1 http://www-01.ibm.com/software/data/infosphere/
ing big volume, big velocity, and big variety. Streaming data pro- streams/
cessing is a model of Big Data processing. Streaming data is tem- 2 http://www-ai.cs.uni-dortmund.de/auto?self=
poral data in nature. In addition to the temporal nature, streaming $eit184kc
data may include spatial characteristics. For example, geographic 3 http://www.streambase.com/
information systems can produce spatial-temporal data stream. 4 http://moa.cs.waikato.ac.nz/
Streaming data processing and mining have been deploying in 5 http://www.tu-ilmenau.de/dbis/research/anduin/
6 http://incubator.apache.org/s4/
7 https://github.com/nathanmarz/storm/wiki/Tutorial
8
8 https://www.numenta.com/grok_info.html

SIGKDD Explorations
SIGKDD Explorations Volume 16,
Volume 16, Issue
Issue 1
1 Page 30
Page 30
Streaming computational model is considered one of the widely- data. Change analysis both detects and explains the change. Hido
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
ing data processing helps the decision-making process in real- vised learning.
time. A data
Streaming stream is defined
computational model as isfollows.
considered one of the widely- data. Change analysis both detects
D EFINITION 4. Change point and explains
detection is the change. Hido
identifying time
used models for processing and analyzing massive data. Stream- et al. [26] proposed a method for change analysis by using super-
points at which properties of time series data change[32]
ingDdata
EFINITION
processing 1. Ahelps
data the stream is an infinite sequence
decision-making process inofreal-
ele- vised learning.
ments
time. A data stream is defined as follows. Depending on specific application, change detection can be
    D EFINITION 4. Change point detection is identifying time
S = (X1 , T1 ) , .., X j , T j , ... (1) called in different terms such as burst detection, outlier detection,
D EFINITION 1. A data stream is an infinite sequence of ele- points at which properties of time series data change[32]
  or anomaly detection. Burst detection a special kind of change
ments
Each element is a pair X j , T j where X j is a d-dimensional vec- DependingBurst
detection. on specific
is a period application,
on streamchange detection sum
with aggregated can ex-be
tor X j = (x1 , x2 , ...,
S= xd) (X
arriving at the time stamp
1 , T1 ) , .., X j , T j , ...

T j . Time-stamp (1) called
ceeding in adifferent
threshold terms [31]. such as burst
Outlier detection,
detection is aoutlier
special detection,
kind of
is defined over discretedomain  with a total order. There are two or anomaly
change detection.
detection. Anomaly Burst detection
detection can a special
be seen kindasofa change
special
Each
types element
of time-stamps: X j , T jtime-stamp
is a pairexplicit where X j is agenerated
d-dimensional vec-
when data detection.
type of changeBurstdetection
is a period on streamdata.
in streaming with aggregated sum ex-
tor
arrive; (x1 , x2 ,time-stamp
X j =implicit ..., xd ) arriving at the time
is assigned stampdata
by some T j . Time-stamp
stream pro- ceeding
To find aasolution
threshold to the[31].problem
Outlierofdetection is a special
change detection, wekindshouldof
is defined
cessing over discrete domain with a total order. There are two
system. change detection.
consider the aspectsAnomaly of changedetection can be
of the system in seen
whichaswea want special to
types of time-stamps: explicit time-stamp is generated when data type
detect.of As
change
shown detection
in [52],in thestreaming
followingdata. aspects of change, which
Streaming data time-stamp
arrive; implicit includes theis fundamental
assigned by some characteristics
data stream as pro-
fol- To
mustfindbeaconsidered,
solution to the problem
include of change
subject of change, detection,
type of wechange,
should
lows.
cessingFirst, data arrives continuously. Second, streaming data
system. consider
cause of the aspects
change, effectof change
of change,of the system of
response in which
change,wetemporal
want to
evolves overtime. Third, streaming data is noisy, corrupted. detect.
issues, As andshown
spatialinissues.
[52], the In following
particular,aspects
to design of change, which
an algorithm
Streaming
Forth, datainterfering
timely includes the fundamental
is important. Fromcharacteristics as fol-
the characteristics mustdetecting
be considered,
lows. First, data for changesinclude in sensor subject of change,
streaming data, thetypemajor
of change,
ques-
of streaming data arrives
and data continuously.
stream model, Second, streaming
data stream data
process- cause
tions we of change,
need to effect
answer of include:
change, Whatresponse of change,
is the system in temporal
which
evolves overtime. Third, streaming data is
ing and mining pose the following challenges. First, as streaming noisy, corrupted. issues, and spatial
Forth, timely interfering is important. From thedata characteristics the changes need toissues.
be detected?In particular,
What are tothe
design an algorithm
principles used to
data arrives rapidly, the techniques of streaming process and for
modeldetecting changesWhat
the problem? in sensor
is datastreaming
type? What data,arethethemajor ques-
constraints
of streaming
analysis must data
keepand data the
up with stream
data model, data stream
rate to prevent fromprocess-
the loss tions we need to answer include: What is the system in which
ing and mining pose the following of the problem? What is the physical subject of change? What is
of important information as well aschallenges.
avoid dataFirst, as streaming
redundancy. Sec- the changes
the meaningneed to be detected?
of change to the user? WhatHow aretothe principles
respond used to
and react to
data
ond, arrives rapidly,
as the speed of the techniques
streaming dataof is streaming
very high, data process
the data and
volume model the problem?
analysis must keep up with the data rate to prevent from the loss this change? How to What visualize is data
thistype?
change? What are the constraints
overcomes the processing capacity of the existing systems. Third, of the problem? What is the physical subject of change? What is
of A change detection method can fall into one of two types: batch
theimportant information
value of data decreasesasover welltime,
as avoid data redundancy.
the recent streaming data Sec-
is the meaning of change to the user? How to respond and react to
change detection and sequential change detection. Given a se-
ond, as the speed of streaming data is very high,
sufficient for many applications. Therefore, one can only capture the data volume this change? How to visualize this change?
overcomes the processing capacity of the existing systems. Third, quence of N observations x1 , .., xN , where N is invariant, the task
and process the data as soon as it is generated. A change detection method can fall into one of two types: batch
the value of data decreases over time, the recent streaming data is of a batch change detection method is deciding whether a change
change detection and sequential change detection. Given a se-
2.1
sufficientChange
for many Detection: Definitions
applications. Therefore, one canand onlyNota-
capture occurs at some point in the sequence by using all N available ob-
quence of N observations x1 , .., xN , where N is invariant, the task
tionthe data as soon as it is generated.
and process servations. When the arriving speed of data is too high, batch
of a batch change detection method is deciding whether a change
change detection is suitable. In other words, change detection
2.1 section
This Change Detection:
presents Definitions
concepts and classificationand Nota-
of changes occurs at some point in the sequence by using all N available ob-
method using two adjacent windows model will be used. How-
tiondetection methods. To develop a change detection
and change servations. When the arriving speed of data is too high, batch
ever, the drawback of batch change detection method is that its
method, we should understand what a change is. change detection is suitable. In other words, change detection
This section presents concepts and classification of changes running time is very large when detecting changes in a large
method using two adjacent windows model will be used. How-
andDchange detection
EFINITION methods.
2. Change To develop
is defined as thea change detection
difference in the amount of data. In contrast,
ever, the drawback of batchthe sequential
change change
detection detection
method prob-
is that its
method, we should understand what a change is.
state of an object or phenomenon over time and/or space [52; lem is based on the observations so far.
running time is very large when detecting changes in a large If no change is detected,
1].D EFINITION 2. Change is defined as the difference in the the next of
amount observation is processed.
data. In contrast, Whenever
the sequential a change
change is detected,
detection prob-
the change detector is reset.
lem is based on the observations so far. If no change is detected,
state
In theofview
an object or phenomenon
change is theover timeofand/or space [52;a Change
1].
of system, process transition from the next detection
observation methods can be classified
is processed. Wheneverinto the following
a change is detected,ap-
state of a system to another. In other words, a change can be de- proaches:
the changethreshold-based
detector is reset.change detection method; state-based
fined
In theas the of
view difference betweenisan
system, change theearlier
process state and a laterfrom
of transition state.a change
Change detection
detection method;
methods trend-based change
can be classified intodetection
the following method.ap-
An
stateimportant
of a system distinction
to another.between
In otherchange
words,and difference
a change can is
be that
de- A change threshold-based
proaches: detection algorithm changeshould meet method;
detection three main require-
state-based
afined
change
as therefers to a transition
difference betweeninantheearlier
state state
of an and
object or astate.
a later phe- ments
change[37]:detectionaccuracy,
method; promptness,
trend-basedand changeonline. The algorithm
detection method.
nomenon
An important overtime while the
distinction difference
between changemeans
andthe dissimilarity
difference in
is that should
A change detect as many
detection as possible
algorithm actual
should meet changethreepoints
main and gen-
require-
the characteristics
a change refers to of two objects.
a transition A change
in the state ofcan
an reflect
object theor ashort-
phe- erate
mentsas[37]:few asaccuracy,
possible false alarms. The
promptness, and algorithm
online. The should detect
algorithm
term trendovertime
nomenon or long-term
whiletrend. For example,
the difference meansa the
stock analyst may
dissimilarity in change point as
should detect as early
manyas as possible. The algorithm
possible actual change points shouldand be gen-
effi-
be
theinterested in theofshort-term
characteristics change
two objects. of thecan
A change stock price.the short-
reflect cient
erate assufficient for a realfalse
few as possible timealarms.
environment.
The algorithm should detect
Change
term trend detection is defined
or long-term asFor
trend. the example,
process ofa stock
identifying
analyst differ-
may Change
change point detection in data
as early stream allows
as possible. us to identify
The algorithm shouldthebetime-
effi-
ences in the state
be interested in theofshort-term
an object or phenomenon
change by observing
of the stock price. it at evolving trends,for
cient sufficient and time-evolving
a real patterns. Research issues on
time environment.
different times [54].
Change detection is In the above
defined as thedefinition,
process of a change
identifyingis detected
differ- mining
Changechangesdetection in in
data datastreams
stream include
allowsmodeling and representa-
us to identify the time-
on thein
ences basis
the of differences
state of an
of an object orobject at different
phenomenon times without
by observing it at tion of changes,
evolving trends, change-adaptive
and time-evolving mining method,
patterns. and interactive
Research issues on
considering
different times the[54].
differences of an definition,
In the above object in locations
a changeinisspace.
detectedIn exploration
mining changes of changes
in data [19].streams Change
includedetection
modeling playsandanrepresenta-
important
many
on thereal
basisworld applications,
of differences changes
of an objectcan occur both
at different timesin terms
withoutof role in changes,
tion of the field change-adaptive
of data stream analysis. Since change
mining method, in model
and interactive
both time and
considering thespace. For example,
differences multiple
of an object spatial-temporal
in locations in space. data
In may conveyofinteresting
exploration changes [19]. time-dependent
Change detection information
plays anand knowl-
important
streams
many real representing triple (latitude,
world applications, changeslongitude,
can occurtime)
both are created
in terms of role inthe
edge, thechange
field ofof data
the data stream
stream can beSince
analysis. used for understanding
change in model
both
in timeinformation
traffic and space. For example,
systems usingmultiple
GPS [23]. spatial-temporal
Hence, changedata de- maynature
the convey of interesting time-dependent
several applications. information
Basically, interesting andresearch
knowl-
streamscan
tection representing
be definedtriple (latitude, longitude, time) are created
as follows. edge, the change
problems on mining of thechanges
data stream can streams
in data be used for canunderstanding
be classified
in traffic information systems using GPS [23]. Hence, change de- the
intonature
three of several applications.
categories: modeling and Basically, interesting
representation research
of changes,
D EFINITION
tection 3. Change
can be defined detection is the process of identify-
as follows. problems
mining on mining
methods, changes in exploration
and interactive data streams of can be classified
changes. Change
ing differences in the state of an object or phenomenon by ob- into threealgorithm
detection categories: canmodeling
be used asand representationinof
a sub-procedure manychanges,
other
D EFINITION
serving 3. Change
it at different detection
times and/or is thelocations
different process in of space.
identify- mining
data methods,
stream mining and interactiveinexploration
algorithms order to deal ofwith
changes. Change
the changing
ing differences in the state of an object or phenomenon by ob- detection
data in dataalgorithm
streamscan [28;be4].used as a sub-procedure
A definition of changeindetection
many other for
A distinction
serving betweentimes
it at different concept driftdifferent
and/or detection and change
locations detec-
in space. data streamdata mining algorithms in order to deal with the changing
streaming is given as follows
tion is that concept drift detection focuses on the labeled data data in data streams [28; 4]. A definition of change detection for
A distinction
change between
detectionconcept
can dealdrift
withdetection and and
change detec-
while both labeled unlabeled D EFINITION
streaming data is5. givenChange detection is the process of segment-
as follows
tion is that concept drift detection focuses on the labeled data
while change detection can deal with both labeled and unlabeled D EFINITION 5. Change detection is the process of segment-

SIGKDD Explorations Volume 16, Issue 1 Page 31


SIGKDD Explorations Volume 16, Issue 1 Page 31
the changes of whose individual example is associated with a given class
the generated model label, otherwise, it is unlabeled data stream. A change de-
tection algorithm that identifies changes in the labeled data
Incoming
the changes of
stream is supervised
whose individual change
example detection [34;
is associated with5], whileclass
a given one
Data stream
Data Stream
the generated model detecting changesit in
label, otherwise, the unlabeled
is unlabeled data data
stream. stream is called
A change de-
Processing/Mining unsupervised change
tection algorithm that detection algorithm
identifies changes in [7]. The advan-
the labeled data
Incoming Model
tage
streamof is
thesupervised
supervisedchange approach is that the
detection [34;detection
5], whileaccu-one
Data stream
the changes of
Data Stream racy is high.
detecting However,
changes in thetheunlabeled
ground truth datadata mustisbecalled
stream gen-
Processing/Mining erated. Thus achange
unsupervised unsupervised
detection change detection
algorithm [7]. approach
The advan- is
data generating process
Model
tage of thetosupervised
preferred the supervisedapproach one isinthatcasethethe ground accu-
detection truth
the changes of racy is unavailable.
data high. However, the ground truth data must be gen-
Figure 1:
dataAgenerating
general process
diagram for detecting changes in data stream erated. Thus a unsupervised change detection approach is
Completeness
preferred to theofsupervised
statistical information:
one in case the Onground
the basis of
truth
the
datacompleteness
is unavailable.of statistical information, a change de-
ing a data
Figure 1: Astream into
general different
diagram forsegments
detectingbychanges
identifying thestream
in data points tection algorithm can fall into one of three following cat-
where the stream dynamics changes [53]. Completeness
egories. Parametric of statistical
change information:
detection schemes On thearebasisbasedof
the knowing
on completeness of statistical
the full information,
prior information a change
before and afterde-
As
ing data
a datastreams
streamevolve overtime
into different in nature,
segments there is growing
by identifying em-
the points tection algorithm
change. For example, can infalltheinto one of three
distributional following
change cat-
detection
phasis on detecting
where the changeschanges
stream dynamics not only[53].
in the underlying data dis- egories.
methods,Parametric change detection
the data distributions before and schemes are based
after change are
tribution, but also in the models generated by data stream process on knowing
known [41; 42].theAfull prior introduced
recently information beforeto and
method after
detecting
As data
and datastreams
stream evolve
mining.overtime
As can in benature,
seen inthere is growing
Figure em-
1, a change change.
changes For example,
in order stockin streams
the distributional
is a parametricchangemethod
detection in
phasis on detecting changes notoronly
theinstreaming
the underlying
model.data dis- methods,
can occur in the data stream, There- which thethe data distributions
distribution of streambefore of stockandorders
after change
confideare to
tribution,
fore, therebut arealso
twoin types
the models
of thegenerated
problemsbyofdata stream
change process
detection: known [41; 42]. A recently introduced method of to detecting
the Poisson distribution [37]. The advantage paramet-
and datadetection
stream mining. Asgenerating
can be seen in Figure 1, a change changes
change in the data process and change detec- ric change detection approaches is that they can method
in order stock streams is a parametric produceina
can occur in the data stream, or the streaming model. There- which
tion in the model generated by a data stream processing, or min- higher the distribution
accurate result of thanstream of stock orders
semi-parametric andconfide
nonpara- to
fore, there
ing. The are two types
fundamental ofofthe
issues problems
detecting of change
changes detection:
in data streams the Poisson distribution [37]. The advantage of paramet-
metric methods. However, in many real-time applications,
change
includesdetection in the data
andgenerating process and change detec- ric change detection approaches is that they can produce
tion in the
characterizing
model generated
quantifying
by a data
of changes
stream
and detect-
processing, or min-
data may not confine to any standard distribution, thusa
ing changes. A change detection method in streaming data needs higher
parametricaccurate result than
approaches semi-parametric
are inapplicable. and nonpara-
Semi-parametric
ing. The fundamental issues of detecting changes in data streams metric methods. However, in many real-time
a trade-off
includes
among space-efficiency, detection performance,
characterizing and quantifying of changes and detect-
and methods are based on the assumption that theapplications,
distribution
time-efficiency. data may not confine
of observations belongstoto any somestandard
class of distribution,
distribution func- thus
ing changes. A change detection method in streaming data needs parametric approaches are inapplicable. Semi-parametric
tion, and parameters of the distribution function change
2.2 Change
a trade-off among Detection Methods
space-efficiency, detection in Streaming
performance, and methods are based on the assumption that the distribution
in disorder moments. Recently, Kuncheva [36] has pro-
time-efficiency.
Data of observations belongs to some class of distribution func-
posed a semi-parametric method using a semi-parametric
tion, and parameters of the distribution function change
2.2 theChange
Over Detection
last 50 years, Methods
change detection in Streaming
has been widely studied log-likelihood for testing a change. Nonparametric meth-
in disorder moments. Recently, Kuncheva [36] has pro-
and applied
Data in both academic research and industry. For exam- ods make no distribution assumptions on the data. Non-
ple, it has been studied for a long time in the following fields: posed a semi-parametric method using a semi-parametric
Over the last 50 years, change detection has been widely studied parametric methods for detecting changes in the underly-
log-likelihood for testing a change. Nonparametric meth-
statistics, signal processing, and control theory. In recent years ing data distribution includes Wilcoxon, kernel method,
and applied in both academic research and industry. For exam- ods make no distribution assumptions on the data. Non-
many
ple, it change detection
has been studiedmethods have
for a long been
time in proposed for stream-
the following fields: Kullback-Leiber distance, and Kolmogorov-Smirnov test.
parametric methods for detecting changes in the underly-
ing data. The approaches to detecting changes in data streamyears
statistics, signal processing, and control theory. In recent can Nonparametric methods can be classified into two cate-
ing data distribution includes Wilcoxon, kernel method,
be classified
many changeasdetection
follows. methods have been proposed for stream- gories: nonparametric methods using window [33]; non-
Kullback-Leiber distance, and Kolmogorov-Smirnov test.
ing data. Thestream
approaches parametric methods without using window into [27].twoWe cate-
have
Data model:toAdetecting changes
data stream in data
can fall intostream
one of can
the Nonparametric methods can be classified
be classified as follows. paid
gories:particular attentionmethods
nonparametric to the nonparametric
using window change [33]; non- de-
following models: time series model, cash register model,
tection
parametricmethods
methodsusingwithout
windowusing because windowin many [27].real-world
We have
and
Dataturnstile model [41].
stream model: A data Onstream
the basis of the
can fall intodata
onestream
of the applications, theattention
distributions
model, there are change paid particular to theofnonparametric
both null hypothesis change and de-
following models: time detection algorithms
series model, developed
cash register model,for
alternative hypothesis are unknown ininadvance. Further-
the corresponding data stream model. Krishnamurthy et al tection methods using window because many real-world
and turnstile model [41]. On the basis of the data stream more, we arethe only interested of in both
recent data. A common
presented a sketch-based change detection applications, distributions null hypothesis and
model, there are change detection algorithmsmethod for the
developed for approach
most general streaming model model.
Turnstile model [35]. et al alternativetohypothesis
identifyingaretheunknown change is in to comparing
advance. two
Further-
the corresponding data stream Krishnamurthy samples
more, weinare orderonlyto interested
find out theindifference
recent data. between
A common them,
presented a sketch-based change detection method for the which is called two-sample change detection, or window-
Data characteristics: Change detection methods can be approach to identifying the change is to comparing two
most general streaming model Turnstile model [35]. based change detection.
classified on the basis of the data characteristics of stream- samples in order to find As out data stream is infinite,
the difference betweenathem, slid-
ing
Datadata such as data dimensionality,
characteristics: Change detection datamethods
label, andcandatabe ing
whichwindow is often
is called used to detect
two-sample changechanges.
detection, Window based
or window-
type. A data
classified on item coming
the basis from
of the datathecharacteristics
data stream can be uni-
of stream- change detection
based change incurs the
detection. Ashighdatadelay
stream [37].is Window-based
infinite, a slid-
variate
ing dataorsuch
multi-dimensional. It would data
as data dimensionality, be great if we
label, andcould
data change
ing windowdetection scheme
is often used is to based
detecton the dissimilarity
changes. Window based mea-
develop a general
type. A data algorithm
item coming able
from thetodata
detect changes
stream can in
be both
uni- sure
changebetween
detectiontwoincurs
distributions
the highordelay synopses extracted from
[37]. Window-based
univariate and multidimensional
variate or multi-dimensional. data streams.
It would be great Change de-
if we could the reference
change detectionwindow
scheme andisthe current
based on thewindow.
dissimilarity mea-
tection
developalgorithms in streaming
a general algorithm ablemultivariate data have
to detect changes in been
both sure between two distributions or synopses extracted from
Velocity of data change: Aggarwal proposes a framework
presented
univariate [14; 34; 36]. Data streams
and multidimensional datacan be classified
streams. Changeintode- the reference window and the current window.
that can deal with the changes in both spatial velocity pro-
categorial data stream
tection algorithms and numerical
in streaming data stream.
multivariate Webeen
data have can
Velocity
file of data change:
and temporal velocityAggarwal
profile [1;proposes
2]. In this approach,
a framework
presentedthe
develop change
[14; detection
34; 36]. algorithm
Data streams canforbecategorial
classified data
into thatchanges
the can dealin datathedensity
with changes occurring at eachvelocity
in both spatial locationpro- are
categorial
stream data stream
or numerical and
data numerical
stream. In realdata stream.
world We can
applications, file and temporal velocityvelocity
profile [1; 2]. In in thissome
approach,
estimated by estimating density user-
develop
each datatheitem
change detection
in data streamalgorithm for categorial
may include multipledataat- the changes in data densityAn occurring at advantage
each location are
defined temporal window. important of this
stream orofnumerical
tributes both numericaldata stream. In real world
and categorial data. applications,
In such situ- estimated
approach isbythat estimating
it visualizes velocity density This
the changes. in some user-
visualiza-
each data
ations, theseitem
datainstreams
data stream may include
can be projected multiple
by each at-
attribute defined temporal helpswindow.
tion of changes userAn important the
understand advantage
changesofintu-this
tributes
or groupofofboth numerical
attributes. and categorial
Change detectiondata. In such
methods cansitu-
be approach
itively. is that it visualizes the changes. This visualiza-
ations, these
applied to thedata streams can be
corresponding projected
projected by streams
data each attribute
after- tion of changes helps user understand the changes intu-
or group
wards. Dataof streams
attributes. areChange
classifieddetection methods
into labeled data can
streambe Speed
itively. of response: If a change detection method needs
applied to the corresponding
and unlabeled data streams. Aprojected
labeled data
data streams
stream isafter-
one to react to the detected changes as fast as possible, the
wards. Data streams are classified into labeled data stream Speed of response: If a change detection method needs
and unlabeled data streams. A labeled data stream is one to react to the detected changes as fast as possible, the

SIGKDD Explorations Volume 16, Issue 1 Page 32


SIGKDD Explorations Volume 16, Issue 1 Page 32
quickest
quickest detection
detection of of change
change should
should be be proposed.
proposed. Quickest
Quickest can
can be
be divided
divided into into structural
structural and and measurement
measurement components.
components. To To
change
change detection
detection can can help
help aa system
system make make aa timely
timely alarm.
alarm. detect
detect deviation
deviation between
between two two models,
models, theythey compare
compare specific
specific parts
parts
Timely
Timely alarmalarm warning
warning is is benefit
benefit for for economical.
economical. In In some
some of
of these
these corresponding
corresponding models. models. The The models
models obtained
obtained by by data
data
cases,
cases, it it may
may save
save the the human
human life life such
such as as inin fire-fighting
fire-fighting mining
mining algorithms
algorithms includes
includes frequent
frequent itemitem sets,
sets, decision
decision trees,
trees, and
and
system.
system. Change
Change detection
detection methods
methods using using two two overlapping
overlapping clusters.
clusters. TheThe change
change in in model
model may may convey
convey interesting
interesting informa-
informa-
windows
windows can can quickly
quickly reactreact to
to the
the changes
changes in in streaming
streaming data data tion
tion or
or knowledge
knowledge of of an
an event
event or or phenomenon.
phenomenon. Model Model change
change is is
while
while methods
methods using using adjacent
adjacent windows
windows model model may may incur
incur defined
defined in in terms
terms of of the
the difference
difference between
between two two set
set ofof parameters
parameters
the
the high
high delay.
delay. AsAs change
change can can bebe abrupt
abrupt change
change or or gradual
gradual of
of two
two models
models and and thethe quantitative
quantitative characteristics
characteristics of of two
two mod-
mod-
change,
change, therethere exists
exists the the abrupt
abrupt change
change detection
detection algorithm els.
algorithm els. As
As such,
such, model
model change
change detection
detection is is finding
finding the the difference
difference
and gradual change detection algorithm [46; 40]. between two set of parameters of two models and the quantita-
and gradual change detection algorithm [46; 40]. between two set of parameters of two models and the quantita-
tive characteristics of these two models. We should distinguish
Decision making methodology: Based on the decision tive characteristics of these two models. We should distinguish
Decision making methodology: Based on the decision between detection of changes in data distribution by using mod-
making methodology, a change detection method can fall between detection of changes in data distribution by using mod-
making methodology, a change detection method can fall els and detection of changes in model built from streaming data.
into one of the following categories: rank-based method els and detection of changes in model built from streaming data.
into one of the following categories: rank-based method While model change detection aims to identify the difference be-
[33], density-based method [55], information-theoretic While model change detection aims to identify the difference be-
[33], density-based method [55], information-theoretic tween two models, change detection in the underlying data distri-
method [15]. A change detection problem can be also clas- tween two models, change detection in the underlying data distri-
method [15]. A change detection problem can be also clas- bution by using models is inferring the changes in two data sets
sified into batch change detection and sequential change bution by using models is inferring the changes in two data sets
from the difference between two models constructed from two
sified into Based
batch on change detection
delayand that sequential change from
detection.
detection. Based
detection
on detection delay
a change detector
that a canchange detector data sets. difference
the The changes between two modelsdata
in the underlying constructed
distribution fromcantwoin-
suffers from, a change detection methods fall into one data
duce sets. The changes inchanges
the corresponding the underlying
in the model data distribution
produced from can the
in-
suffers from, a change
of two following types: detection
real-time methods can fall into
change detection, and one
ret- duce the corresponding changes in the model produced from the
of two following data generating process.
rospective changetypes: real-time
detection. Based change
on thedetection,
spatial orand tempo-ret- data generating
As models can be process.
generated by statistics method or data mining
rospective change detection. Based on
ral characteristics of data, change detection algorithm canthe spatial or tempo- As models can be generatedinby statistics
ral characteristics of data, methods, change detection models canmethod or datainto
be classified mining
data
fall into one of three kinds:change
spatialdetection algorithmtem-
change detection; can methods,
mining change
model anddetection
statisticalinmodel.
modelsTwo cankinds
be classified
of models into
wedataare
fall
poralinto one ofdetection;
change three kinds: spatial change detection;
or spatio-temporal change detec- tem- mining model and statistical model. Two kinds of models we are
poral[6].change detection; or spatio-temporal change detec- interested in detecting changes are predictive model and explana-
tion interested
tory model. in Predictive
detecting changes
model is areused
predictive model
to predict theand explana-
changes in
tion [6]. tory model.Detecting
Predictive model inis the used to predict
Application: On the basis of applications that generate data the future. changes pattern can bethe changesfor
beneficial in
Application: the future.
many Detecting
applications. Inchanges
explanatory in the pattern
model, can be beneficial
a change that occurred for
streams, data On the basis
streams can beof applications
classified as that into generate
transactionaldata
streams, data streams many
is bothapplications.
detected and In explained.
explanatoryThere model, area change that occurred
some approaches to
data stream, sensor can databestream,
classified as into data
network transactional
stream,
data stream, sensor data stream, data network data stream, is both detection:
change detected and explained.
one-model There are
approach, some approaches
two-model approach, or to
stock order data stream, astronomy stream, video data
stock
stream, order
etc. data
Based stream,
on theastronomy data stream,there
specific applications, video aredata
the change detection:
multiple-model one-model approach, two-model approach, or
approach.
stream, etc.
change Based methods
detection on the specific
for theapplications,
corresponding thereapplica-
are the multiple-model
A model-basedapproach. change detection algorithm consists of two
change
tions such detection
as change methods for methods
detection the corresponding
for sensor applica-
stream- A model-based
phases as follows: changemodeldetection
constructionalgorithm consistsdetection.
and change of two
tionsdata
ing such[56],
as change
changedetection
detectionmethods
methodsfor forsensor stream-
transactional phases as follows:
First, a model is builtmodel
by using construction
some stream andmining
change detection.
method such
ing data [56],
streaming datachange
[45; 57;detection methodsvan
8]. For example, for Leeuwen
transactionaland First,
as a model
decision is clustering,
tree, built by using somepattern.
frequent stream mining
Second,method such
a difference
streaming
Siebes [57]data
have [45; 57; 8]. For
presented example,
a change van Leeuwen
detection methodand for as decision
measure tree, clustering,
between two models frequent pattern.based
is computed Second,the acharacteris-
difference
Siebes [57] have
transactional presented
streaming dataabased
change on detection
the principle methodof Min-for measure
tics of the between
model,two thismodels
step isisalso
computed
called based the characteris-
the quantification of
transactional
imum streaming
Description Length.data based on the principle of Min- tics of difference.
model the model, Therefore,
this step isone alsofundamental
called the quantification
issue here is of to
imum Description Length. model difference.
quantify the changes Therefore,
between two one models
fundamentaland to issue
determinehere crite-
is to
Stream processing methodology: Based on methodology ria for making
quantify decision
the changes whether
between twoand whenand
models a change in the model
to determine crite-
Stream
for processing
processing methodology:
data stream, a data stream
Based on canmethodology
be classified occurs. Recently,
ria for making some whether
decision change detection
and when methodsa change in the streaming
model
into online data
for processing datastream
stream,anda off-line
data stream datacan stream [38]. In
be classified data by clustering
occurs. Recently, some have beenchange proposed
detection[10;methods
3]. Basedinon the data
streaming
some work, data
into online an online
streamdata andstream
off-lineis datacalled a live[38].
stream stream In stream
data bymining
clustering model,
havewe beenmay have the[10;
proposed corresponding
3]. Based onproblems
the data
while
some an work,off-line data stream
an online is called
data stream is archived
called a data live stream of detecting
stream mining changes
model,inwe modelmayashavefollows. Ikonomovska et
the corresponding al. [30]
problems
[18].
whileOnline datadata
an off-line streamstreamneeds to be processed
is called archived data online be-
stream have presented
of detecting an algorithm
changes in model for learningIkonomovska
as follows. regression trees et al.from
[30]
cause of its high
[18]. Online data speed.
stream Such needsonline data streams
to be processed include
online be- streaming
have presented data in an the presence
algorithm foroflearning
conceptregression
drifts. Their treeschange
from
streams
cause ofofitsstockhighticker,
speed.streams of network
Such online measurements,
data streams include detection
streamingmethod data inisthe based on sequential
presence of concept statistical
drifts.tests
Their that mon-
change
and
streamsstreams
of stockof sensor data,etc.ofOff-line
ticker, streams networkstream is a se-
measurements, itoring
detection themethod
changes of theon
is based local error, atstatistical
sequential each node testsof that
tree,mon-
and
quence
and streams of updates
of sensor to warehouses or backup
data,etc. Off-line devices.
stream is a These- inform
itoring the
the learning
changes process of theerror,
of the local localatchanges.
each node of tree, and
queries
quence of over the off-line
updates streams orcan
to warehouses be processed
backup devices. The off- Detecting
inform thechangeslearningof streamofcluster
process model
the local has been received in-
changes.
line. However, as off-line
it is insufficient
streams time can to be process off-line creasing
queries over the processed off- Detectingattention.
changes of Zhou et al.cluster
stream [59] have
model presented
has beena received
method for in-
streams, techniques
as it isforinsufficient
summarizing timedata are necessary. tracking
line. However, to process off-line creasing the evolution
attention. Zhou of et
clusters
al. [59] over
havesliding windows
presented by using
a method for
In off-linetechniques
change detection method, the entire
are data set is temporal cluster features and the overexponential histogram,
streams, for summarizing data necessary. tracking the evolution of clusters sliding windows bywhich
using
available
In off-linefor the analysis process to detect the change. The called exponential histogram
change detection method, the entire data set is temporal cluster features andofthecluster features.histogram,
exponential Chen and Liu which[9]
online method
for thedetects theprocess
changetoincrementally basedThe on have
available analysis detect the change. calledpresented
exponential a framework
histogram for detecting
of cluster the changes
features. Chen and in cluster-
Liu [9]
the recently incoming
detects data item. An important distinction ing
online
between
method
off-line method
the change
and online
incrementally based on havestructures
presentedconstructed
a framework fromfor categorial
detecting the datachanges
streamsinby using
cluster-
the recently incoming data item. An one is that distinction
important the online hierarchial entropy trees to capture the entropy
ing structures constructed from categorial data streams by using characteristics of
method
betweenisoff-line
constrained byand
the online
detection oneand reaction time clusters, andentropy
then detecting
due to the
method is that the online hierarchial trees to changes
capture the in clustering structures based
entropy characteristics of
method is requirement
constrained by of real-time
the detection applications
and reaction whiletimethe on these and
clusters, entropythencharacteristics.
detecting changes in clustering structures based
off-line is free from the detection time, and reaction time. Based onentropy
the data stream mining model, we may have the cor-
due to the requirement of real-time applications while the on these characteristics.
Methods for detecting changes can be useful for stream- responding problems of detecting changes in model as follows
off-line is free from the detection time, and reaction time. Based on the data stream mining model, we may have the cor-
ing data warehouses where both live streams of data and [14]. Recently Ng and Dash [44] have introduced an algorithm
Methods for detecting changes can be useful for stream- responding problems of detecting changes in model as follows
archived data streams are available [24; 29]. In this work, for mining frequent patterns from evolving data streams. Their
ing data warehouses where both live streams of data and [14]. Recently Ng and Dash [44] have introduced an algorithm
we focus on developing the methods for detecting changes algorithm is capable of updating the frequent patterns based on
archived data streams are available [24; 29]. In this work, for mining frequent patterns from evolving data streams. Their
in online data streams, in particular, sensor data streams. the algorithms for detecting changes in the underlying data dis-
we focus on developing the methods for detecting changes algorithm is capable of updating the frequent patterns based on
in work
onlineondata streams, inchange particular, sensorproposed
data streams. tributions. Two windows are used for change detection: the ref-
The first model-based detection by [21; the algorithms
erence windowfor anddetecting
the current changes
window. in the
At underlying data dis-
the initial stage, the
22] is FOCUS. The central idea behind FOCUS is that the models tributions. Two windows are used for change detection: the ref-
The first work on model-based change detection proposed by [21;
erence window and the current window. At the initial stage, the
22] is FOCUS. The central idea behind FOCUS is that the models

SIGKDD Explorations Volume 16, Issue 1 Page 33


SIGKDD Explorations Volume 16, Issue 1 Page 33
reference is initialized with the first batch of transactions from erance. The scalability refers to the ability to extend the
reference
data is initialized
stream. The currentwith the first
window batch
moves onof
thetransactions
data streamfrom and erance.
size Thenetwork
of the scalability refers
without to the ability
significantly to extend
reducing the
the per-
data stream.
captures the The
next current
batch ofwindow moves Two
transactions. on the data stream
frequent and
item sets size of theof
formance network without significantly
the framework. As faults may reducing
occurthe
dueper-
to
are constructed
captures the nextfrom
batchtwoofcorresponding
transactions. Two windows
frequentby using
item setsthe the transmission
formance of the error and the As
framework. effects
faultsof noisy channels
may occur duebe-
to
Apriori algorithm.
are constructed fromA statistical test is performed
two corresponding windows on by
twousing
absolutethe tween local sensors
the transmission and
error andfusion center,ofa noisy
the effects distributed change
channels be-
support values thatAare
Apriori algorithm. computed
statistical testby
is the Apriorion
performed from
twothe refer-
absolute detection
tween local method
sensors should be able
and fusion to tolerate
center, these faults
a distributed changein
ence
supportwindow
valuesand
thatcurrent window.byBased
are computed on the from
the Apriori statistical test,
the refer- order
detectionto assure
method theshould
functionbe of
ablethetosystem.
tolerate these faults in
the
encedeviation
window canand be significant
current window. or insignificant.
Based on theIfstatistical
the deviation
test, order to assure the function of the system.
is
thesignificant
deviation then a change
can be in theordata
significant stream is reported.
insignificant. Chang
If the deviation Distributed change detection using the local approach is di-
and Lee [8] have
is significant then presented
a change ina themethod for monitoring
data stream the Chang
is reported. recent rectly relevant
Distributed to the
change problemusing
detection of multiple
the localhypotheses
approach istest-
di-
change
and Leeof[8]frequent item setsa from
have presented methoddataforstream by using
monitoring the sliding
recent ing
rectly and data fusion
relevant because of
to the problem each local change
multiple hypothesesdetector
test-
window.
change of frequent item sets from data stream by using sliding needs
ing and to data
perform
fusiona hypothesis
because each test local
to determine
change whether
detector
window. aneeds
change occurs. Therefore,
to perform a hypothesisbesides
test toconsidering the de-
determine whether
2.3 Design Methodology tection
a change performance of local change
occurs. Therefore, besides detection
consideringalgorithms
the de-
2.3 Design
There are Methodology
two design methodologies for developing the change including probabilityof
tection performance oflocal
detection
change anddetection
probability of false
algorithms
detection alarm
includingat the node level,ofthe
probability detection
detection and performance
probability of
ofafalse
dis-
There are algorithms
two design in streaming data.
methodologies for The first methodology
developing the change
is to adaptalgorithms
the existing tributed
alarm at change
the nodedetection
level, themethod
detection at the fusion center
performance of amust
dis-
detection in change
streamingdetection
data. Themethods
first for streaming
methodology
data. However, many traditional change detection methods can- be takenchange
tributed into account.
detection method at the fusion center must
is to adapt the existing change detection methods for streaming
not
data.beHowever,
extendedmany for streaming
traditionaldata because
change of the methods
detection high compu-
can- be taken into account.
Distributed detection and data fusion have been widely studied
tational complexity
not be extended for such as some
streaming kernel-based
data because of change
the highdetection
compu- for many decades. However,
methods, and density-based change detection methods. The sec- Distributed detection and dataonly recently,
fusion have distributed
been widelydetection
studied
tational complexity such as some kernel-based change detection in
forstreaming
many decades.data has receivedonly
However, attention.
recently, distributed detection
ond methodology
methods, is to develop
and density-based new change
change detection detection
methods. methods for
The sec- in streaming data has received attention.
streaming data. is to develop new change detection methods for
ond methodology 3.1 Distributed Detection: One-shot versus
There are two
streaming data.common approaches to the problem of change de- 3.1 Continuous
Distributed Detection: One-shot versus
tection
There arein streaming
two common dataapproaches
distributions: distance-based
to the change de-
problem of change de- Continuous
Distributed detection of changes can be classified into two types
tectors and
tection in predictive
streaming datamodel-based
distributions:change detectors. change
distance-based In the for-
de- of models asdetection
follows. of changes can be classified into two types
mer, two windows are used to extractchange
two data segments Distributed
tectors and predictive model-based detectors. Infrom the
the for- of models as follows.
data
mer, stream.
two windowsThe change
are used is to
quantified
extract twoby data
usingsegments
some dissimilar-
from the One-shot distributed detection of changes: Figure 2 shows
ity
datameasure.
stream. TheIf thechange
dissimilarity measure
is quantified is greater
by using somethan a given
dissimilar- two models
One-shot of one-shot
distributed distributed
detection change detection.
of changes: One-
Figure 2 shows
threshold then a change is detected. Similar to
ity measure. If the dissimilarity measure is greater than a given distance-based shot change of
two models detection
one-shot method means
distributed a change
change detector
detection. de-
One-
change
threshold detectors, two windows
then a change are used
is detected. for detecting
Similar changes.
to distance-based tects and reacts
shot change to themethod
detection detected change
means once detector
a change a changede-is
Instead of comparing the dissimilarity measure
change detectors, two windows are used for detecting changes. between two win- detected.
tects and One-shot
reacts to thedistributed
detectedchange
changedetection have re-
once a change is
dows
Insteadwith a given threshold, a change is detected by using the ceived great deal of attention forchange
a long time. One-shot
of comparing the dissimilarity measure between two win- detected. One-shot distributed detection havedis-
re-
prediction error of the model built from the current window and tributed change
dows with a given threshold, a change is detected by using the ceived great dealdetection include
of attention two models:
for a long distributed
time. One-shot dis-
the predictive
prediction model
error of theconstructed
model built from
from thethereference
current window.
window and detection with decision with decision fusion as shown in
tributed change detection include two models: distributed
the predictive model constructed from the reference window. Figure
detection 2(a);
withdistributed
decision detection without
with decision decision
fusion fusion
as shown in
3. DISTRIBUTED CHANGE DETECTION as illustrated in Figure 2(b). What are the differences
Figure 2(a); distributed detection without decision fusion be-
tween one-shot in detection and What
continuous
are thedetection.
3. IN STREAMINGCHANGE
DISTRIBUTED DATA DETECTION as illustrated Figure 2(b). differences be-
IN STREAMING
Knowledge DATA
discovery from massive amount of streaming data tween one-shot
Continuous detection
distributed and continuous
detection detection.
of changes: In this chap-
can be achieved only when ter, we propose two continuous distributed detection mod-
Knowledge discovery from we could amount
massive develop of thestreaming
change detec-
data Continuous distributed detection of changes: In this chap-
tion frameworks that monitor streaming data created by multiple els as shown in Figure 3. An important distinction
ter, we propose two continuous distributed detection mod-between
can be achieved only when we could develop the change detec-
sources such as sensor networks, WWWdata [13].created
The objectives of continuous
els as showndistributed
in Figure 3.detection of changes
An important and one-shot
distinction between
tion frameworks that monitor streaming by multiple
designing a distributed distributed detection of changes is that the inputs to the
sources such as sensor change
networks,detection
WWWscheme [13]. Theareobjectives
maximizing of continuous distributed detection of changes and one-shot
the lifetime of the network, maximizing the detection capability, one-shot distributed change detection are batches of data
designing a distributed change detection scheme are maximizing distributed detection of changes is that the inputs to the
and minimizing the communication cost [58]. while the inputs to the continuous distributed detection of
the lifetime of the network, maximizing the detection capability, one-shot distributed change detection are batches of data
There are two approaches to the problem of change detection in changes are the data streams in which data items continu-
and minimizing the communication cost [58]. while the inputs to the continuous distributed detection of
streaming data that is created from multiple sources. In the cen- ously arrive.
There are two approaches to the problem of change detection in changes are the data streams in which data items continu-
tralized approach: all remote sites send raw data to the coordi- ously detection
arrive. model without fusion is a truly distributed
streaming data that is created from multiple sources. In the cen- Distributed
nator. The coordinator aggregates all the raw streaming data that
tralized approach: all remote sites send raw data to the coordi- detection model in which The decision-making process occurs at
is received from the remote sites. Detection of changes is per- Distributed detection model without fusion is a truly distributed
nator. The coordinator aggregates all the raw streaming data that each sensor.
formed on the aggregated streaming data. In most cases, com- detection model in which The decision-making process occurs at
is received from the remote sites. Detection of changes is per-
munication consumes the largest amount of energy. The lifetime each
formed on the aggregated streaming data. In most cases, com-
of sensors therefore drastically reduces when they communicate
3.2 sensor.
Locality in Distributed Computing
munication consumes the largest amount of energy. The lifetime As
raw measurements to a centralized server for analysis. Central-
of sensors therefore drastically reduces when they communicate
3.2oneLocality
of the properties of distributed Computing
in Distributed computational systems
ized approaches suffer from the following problems: communi- is locality [43], a distributed algorithm for detecting changes in
raw measurements to a centralized server for analysis. Central- As one of the properties of distributed computational systems
cation constraint, power consumption, robustness, and privacy. streaming data should meet the locality. A local algorithm is de-
ized approaches suffer from the following problems: communi- is locality [43], a distributed algorithm for detecting changes in
Distributed detection of changes in streaming data addresses the fined as one whose resource consumption is independent of the
cation constraint, power consumption, robustness, and privacy. streaming data should meet the locality. A local algorithm is de-
challenges that come from the problem of change detection, data system size. The scalability of distributed stream mining algo-
Distributed detectionand
of changes in streaming data addresses the fined
rithmsascan
onebewhose resource
achieved consumption
by using is independent
the local change detectionof the
algo-
stream processing, the problem of distributed computing. system
challenges
The challengesthat come
comingfrom thethe
from problem of change
distributed detection,
computing data
environ- rithms size. The scalability of distributed stream mining algo-
stream rithms can be achieved by using the local change detection algo-
ment areprocessing,
as follows and the problem of distributed computing. Local algorithms can fall into one of two categories [16]:Exact
rithms
The challenges coming from the distributed computing environ- local algorithms are defined as ones that produce the same results
ment are as follows
Distributed change detection in streaming data is a problem Local algorithmsalgorithm;
as a centralized can fall into one of twolocal
Approximate categories [16]:Exact
algorithms are al-
of distributed computing in nature. Therefore, a distributed local algorithms are defined as ones that produce the same
gorithms that produce approximations of the results that central-results
Distributed
framework for change detection
detecting in streaming
changes data isthe
should meet a problem
proper- as a centralized
ized algorithms algorithm; Approximate
would produce. local algorithms
Two attractive properties are al-
of lo-
of distributed computing in nature. Therefore, a distributed
ties of distributed computing such scalability, and fault tol- gorithms that produce approximations of the results that central-
cal algorithms are scalability and fault tolerance. A distributed
framework for detecting changes should meet the proper- ized algorithms would produce. Two attractive properties of lo-
ties of distributed computing such scalability, and fault tol- cal algorithms are scalability and fault tolerance. A distributed

SIGKDD Explorations Volume 16, Issue 1 Page 34


SIGKDD Explorations Volume 16, Issue 1 Page 34
Phenomenon
Phenomenon
x1 Phenomenon x2
Phenomenon
x1 x2
x1 x2
x1 x2

u1 u2
Local decision Local decision
u1 u2
Local decision Local decision
Global decision
u1 u2
Global decision
(a) Distributed detection without decision (b) Distributed detection with decision fusion
u1
fusion u2
(a) Distributed detection without decision (b) Distributed detection with decision fusion
fusion Figure 2: One-shot distributed change detection models

Figure 2: One-shot distributed change detection models

Phenomenon

Phenomenon Data stream 1 Phenomenon Data stream 2

Data stream 1 Phenomenon Data stream 2 Data stream 1 Data stream 2

Data stream 1 Data stream 2

u1 u2
Local decision Local decision
u1 u1 u2
u2
Local decision Local decision
u1 Global decision
u2
Local decision Local decision
Global decision
(a) Distributed continuous Local
detection without deci- (b) Distributed continuous detection with decision
decision
Local decision
sion fusion fusion
(a) Distributed continuous detection without deci- (b) Distributed continuous detection with decision
sion fusion fusion
Figure 3: Continuous distributed change detection models

Figure 3: Continuous distributed change detection models

SIGKDD Explorations Volume 16, Issue 1 Page 35


SIGKDD Explorations Volume 16, Issue 1 Page 35
framework for mining streaming data should be robust to net- work in change detection is expected to exploit such scalable data
framework
work for mining
partitions, and node streaming
failures. data should be robust to net- work in change
processing toolsdetection is expected
in efficiently detect, to exploitand
localize such scalable
classify data
occur-
workadvantage
The partitions,ofand node
local failures. is the ability to preserve pri-
approaches processing
ring tools
changes. Forinexample,
efficiently detect, localize
distributed changeand classifymodels
detection occur-
The advantage
vacy of local approaches
[20]. A drawback of the localisapproach
the ability
to to
thepreserve
problempri-of ring make
can changes.
use For example,
of the distributed
MapReduce changetodetection
framework models
accelerate their
vacy [20]. A
distributed drawback
change of the
detection is local approach to theproblem.
the synchronization problemFor of can make use
respective of the MapReduce framework to accelerate their
processes.
example,
distributedthe local change
change detectionapproach can meet the principle
is the synchronization problem. ofFor
lo- respective processes.
calized
example,algorithms
the local in wireless
change sensorcan
approach networks in which
meet the dataofpro-
principle lo-
cessing is performed
calized algorithms at node-level
in wireless sensorasnetworks
much asinpossible in order
which data pro-
5. REFERENCES
to reduceisthe
cessing amount of
performed informationastomuch
at node-level be sent
as in the network.
possible in order
5. REFERENCES
to reduce the amount of information to be sent in the network. [1] C. Aggarwal. A framework for diagnosing changes in
3.3 Distributed Detection of Changes in evolving
[1] C. data streams.
Aggarwal. In Proceedings
A framework of the changes
for diagnosing 2003 ACM in
3.3 DistributedData
Streaming Detection of Changes in SIGMOD international
evolving data streams. conference on Management
In Proceedings of the 2003 of ACM
data,
Over theStreaming Data
last decades, the problem of decentralized detection has pages
SIGMOD 575586. ACM New
international York, NY,
conference USA, 2003. of data,
on Management
received
Over the lastmuch attention.
decades, There are
the problem of two directionsdetection
decentralized of researchhas pages 575586. ACM New York, NY, USA, 2003.
on decentralized detection.ThereThe first [2] C. Aggarwal. On change diagnosis in evolving data streams.
received much attention. are approach focusesofon
two directions aggre-
research
gating measurements from The multiple sensors tofocuses
test a single hy- IEEE
[2] C. Transactions
Aggarwal. on Knowledge
On change diagnosis inand Data Engineering,
evolving data streams.
on decentralized detection. first approach on aggre-
pothesis. The second focuses on dealing with to
multiple pages
IEEE 587600,
Transactions 2005.
on Knowledge and Data Engineering,
gating measurements from multiple sensors test a dependent
single hy-
testing/estimation
pothesis. The second tasks fromon
focuses multiple
dealingsensors [51]. Distributed
with multiple dependent pages 587600, 2005.
[3] C. Aggarwal. A segment-based framework for modeling
change detection usually
testing/estimation tasks frominvolves
multiplea setsensors
of sensors
[51]. that receive
Distributed andAggarwal.
[3] C. mining data A streams. Knowledge
segment-based and information
framework sys-
for modeling
observations
change detection from usually
the environment
involves and a setthen
of transmit
sensors those obser-
that receive tems, 30(1):129,
and mining 2012. Knowledge and information sys-
data streams.
vations back to
observations fusion
from center in order
the environment and tothen
reach the final
transmit consensus
those obser-
of detection. Decentralized tems, 30(1):129, 2012.
vations back to fusion centerdetection
in order toand datathefusion
reach are there-
final consensus [4] C. Aggarwal and P. Yu. A survey of synopsis construction
fore two closely
of detection. related tasks
Decentralized that arise
detection andin data
the context
fusion are of sensor
there- in data streams. Data streams: models and algorithms, page
networks [48; 47]. Two traditional approaches to the decentral- [4] C. Aggarwal and P. Yu. A survey of synopsis construction
fore two closely related tasks that arise in the context of sensor 169, 2007.
in data streams. Data streams: models and algorithms, page
ized change
networks detection
[48; 47]. Two aretraditional
data fusion,approaches
and decision to fusion. In data
the decentral-
fusion, each node detects change and sends quantized version of 169, 2007.
ized change detection are data fusion, and decision fusion. In data [5] A. Bondu and M. Boull. A supervised approach for change
its observation
fusion, each node to detects
a fusionchange
centerand responsible for making
sends quantized versiondeci-
of detection
[5] A. Bondu in data
and M.streams.
Boull. A Insupervised
Neural Networks (IJCNN),
approach The
for change
sion on the detected
its observation changes,
to a fusion and responsible
center further relaying information.
for making deci- 2011 International Joint In
Conference on, pages 519526.
detection in data streams. Neural Networks (IJCNN), The
In
sioncontrast,
on the in decision
detected fusion, and
changes, eachfurther
node performs
relaying local change
information. IEEE, 2011.
detection byinusing somefusion,
local change algorithm andlocal
updates its
2011 International Joint Conference on, pages 519526.
In contrast, decision each node performs change IEEE, 2011.
decision
detectionbased on the
by using received
some information
local change and broadcasts
algorithm and updates again
its [6] S. Boriah, V. Kumar, M. Steinbach, C. Potter, and
its new decision. Thisreceived
processinformation
repeats until andconsensus
broadcastsamong S. Klooster.
decision based on the again [6] S. Boriah, Land cover change
V. Kumar, detection: C.
M. Steinbach, a case study.and
Potter, In
the nodesdecision.
are reached.
ThisCompared to datauntil
fusion, decision among
fusion Proceeding
its new process repeats consensus S. Klooster.ofLand
the 14th
coverACM SIGKDD
change international
detection: confer-
a case study. In
can reduceare
the nodes thereached.
communication
Compared costtobecause sensors
data fusion, need only
decision fusion to ence on Knowledge discovery and datainternational
mining, pages 857
Proceeding of the 14th ACM SIGKDD confer-
transmit the local decisions represented by small data structures. 865.
can reduce the communication cost because sensors need only to ence ACM, 2008. discovery and data mining, pages 857
on Knowledge
Although
transmit the there is great
local deal represented
decisions of work on distributed
by small data detection and
structures. 865. ACM, 2008.
data fusion, most of work focuses on the one-time
Although there is great deal of work on distributed detection and change detec- [7] G. Cabanes and Y. Bennani. Change detection in data
tion
data solutions. One-time
of workquery is defined as a querychange
that needs to streams through
fusion, most focuses on the one-time detec- [7] G. Cabanes and unsupervised learning.detection
Y. Bennani. Change In Neural
in Net-
data
proceed data once in order to provide the answer [12]. Likewise, works
tion solutions. One-time query is defined as a query that needs to streams(IJCNN),
through The 2012 International
unsupervised learning. Joint Conference
In Neural Net-
one-time change
oncedetection method is athe
change detection that re- on, pages 16. IEEE,
proceed data in order to provide answer [12]. Likewise, works (IJCNN), The 2012.
2012 International Joint Conference
quires to proceed data once in response to the change
one-time change detection method is a change detection that re- occurred. In
on, pages 16. IEEE, 2012.
real-world applications, we need the approaches
quires to proceed data once in response to the change occurred. Incapable of con- [8] J. Chang and W. Lee. estwin: adaptively monitoring the re-
tinuously
real-worldmonitoring thewe changes of approaches
the events occurring
capable ofincon- the cent change
applications, need the [8] J. Chang andofW.frequent itemsets
Lee. estwin: over online
adaptively data streams.
monitoring the re-
environment. Recently, work on continuous detection
tinuously monitoring the changes of the events occurring in the and mon- In Proceedings of the twelfth international conference on
cent change of frequent itemsets over online data streams.
itoring of changes has been
work started receiving attentionand such as Information andofknowledge
environment. Recently, on continuous detection mon- In Proceedings the twelfthmanagement,
internationalpages 536539.
conference on
[49; 13; 50]. Das et al. [13] have presented a scalable distributed ACM, 2003.
itoring of changes has been started receiving attention such as Information and knowledge management, pages 536539.
framework for detecting changes in astronomy data streams us-
[49; 13; 50]. Das et al. [13] have presented a scalable distributed ACM, 2003.
ing local, asynchronous eigen monitoring algorithms. Palpanas [9] K. Chen and L. Liu. HE-Tree: a framework for detecting
framework for detecting changes in astronomy data streams us-
et al. [49] proposed a distributed framework for outlier detection changes in clustering structure for categorical data streams.
ing local, asynchronous eigen monitoring algorithms. Palpanas [9] K. Chen and L. Liu. HE-Tree: a framework for detecting
in real-time data streams. In their framework, each sensor esti- The VLDB Journal, pages 120.
et al. [49] proposed a distributed framework for outlier detection changes in clustering structure for categorical data streams.
mates and maintains a model for its underlying distribution by
in real-time data streams. In their framework, each sensor esti- The VLDB Journal, pages 120.
[10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
using kernel density estimators. However, they did not show how
mates and maintains a model for its underlying distribution by Segment-based change detection method in multivariate
to reach the global detection decision. [10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
using kernel density estimators. However, they did not show how data stream, Apr. 9 2009. WO Patent WO/2009/045,312.
to reach the global detection decision. Segment-based change detection method in multivariate
4. CONCLUDING REMARKS dataCormode.
[11] G. stream, Apr.
The 9continuous
2009. WOdistributed
Patent WO/2009/045,312.
monitoring model.
We
4. argued in this paper that variability,
CONCLUDING REMARKS or simply change, is cru- SIGMOD Record, 42(1):5, 2013.
cial in a world full of affecting factors that alter the behavior of [11] G. Cormode. The continuous distributed monitoring model.
We argued in this paper that variability, or simply change, is cru- SIGMOD Record,
the data, and consequently the underlying model. The ability to
cial in a world full of affecting factors that alter the behavior of [12] G. Cormode and 42(1):5, 2013. Efficient strategies for
M. Garofalakis.
detect such changes in centralized as well as distributed system continuous distributed tracking tasks. IEEE Data Engineer-
the data, and consequently the underlying model. The ability to [12] G.
plays an important role in identifying validity of data models.
detect such changes in centralized as well as distributed system ing Cormode and M. Garofalakis.
Bulletin, 28(1):3339, 2005. Efficient strategies for
The paper presented the state-of-the-art in this area of paramount continuous distributed tracking tasks. IEEE Data Engineer-
plays an important role in identifying validity of data models. ingDas,
Bulletin, 28(1):3339, 2005.
importance. Techniques, in some cases, are tightly coupled with [13] K. K. Bhaduri, S. Arora, W. Griffin, K. Borne, C. Gi-
The paper presented
application domains.the state-of-the-art
However, most of in
thethis area of paramount
techniques reviewed annella, and H. Kargupta. Scalable Distributed Change De-
importance.
in this paper are generic and could be adapted to coupled
Techniques, in some cases, are tightly differentwith [13] K. Das,from
K. Bhaduri, S. Arora,
do- tection Astronomy Data W. Griffin,
Streams K. Borne,
using Local, C. Gi-
Asyn-
application domains. However, most of the techniques reviewed annella,
mains of applications. chronousand H. Kargupta.
Eigen Scalable
Monitoring Distributed
Algorithms. Change
In SIAM De-
Interna-
in this paper are generic and could be adapted to different do- tection from Astronomy Data Streams using2009.
Local, Asyn-
With Big Data technologies reaching a mature stage, the future tional Conference on Data Mining, Nevada,
mains of applications. chronous Eigen Monitoring Algorithms. In SIAM Interna-
With Big Data technologies reaching a mature stage, the future tional Conference on Data Mining, Nevada, 2009.

SIGKDD Explorations Volume 16, Issue 1 Page 36


SIGKDD Explorations Volume 16, Issue 1 Page 36
[14] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and [29] W. Huang, E. Omiecinski, L. Mark, and M. Nguyen. His-
[14] T.
K. Dasu, S. Krishnan,
Yi. Change D. Lin,
(Detection) You S.
CanVenkatasubramanian,
Believe in: Finding Dis- and [29] tory
W. Huang,
guidedE.low-cost
Omiecinski,
changeL. Mark, and in
detection M.streams.
Nguyen. Data
His-
K. Yi. Change
tributional (Detection)
Shifts You CanInBelieve
in Data Streams. in: Finding
Proceedings of theDis-
8th tory guided low-cost
Warehousing change detection
and Knowledge Discovery,in streams. Data
pages 7586,
International Symposium
tributional Shifts on Intelligent
in Data Streams. Data Analysis:
In Proceedings of the Ad-
8th 2009.
Warehousing and Knowledge Discovery, pages 7586,
vances in Intelligent
International Dataon
Symposium Analysis VIII,Data
Intelligent pageAnalysis:
34. Springer,
Ad- 2009.
2009.
vances in Intelligent Data Analysis VIII, page 34. Springer, [30] E. Ikonomovska, J. Gama, R. Sebastio, and D. Gjorgjevik.
2009. [30] Regression trees from
E. Ikonomovska, dataR.
J. Gama, streams withand
Sebastio, driftD.detection. In
Gjorgjevik.
[15] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. Discovery
RegressionScience,
trees frompages
data121135.
streams Springer,
with drift2009.
detection. In
AnDasu,
[15] T. information-theoretic
S. Krishnan, S. approach to detecting and
Venkatasubramanian, changes in
K. Yi. Discovery Science, pages 121135. Springer, 2009.
multi-dimensional data streams.
An information-theoretic In 38th
approach Symposium
to detecting on the
changes in [31] M. Karnstedt, D. Klan, C. Plitz, K.-U. Sattler, and
Interface of Statistics,
multi-dimensional Computing
data streams. Science,
In 38th and Applica-
Symposium on the [31] C.
M. Franke. Adaptive
Karnstedt, burst detection
D. Klan, C. Plitz,in K.-U.
a stream engine.and
Sattler, In
tions. Citeseer,
Interface 2005. Computing Science, and Applica-
of Statistics, Proceedings of the 2009
C. Franke. Adaptive ACM
burst symposium
detection on Applied
in a stream Com-
engine. In
tions. Citeseer, 2005. puting, pagesof15111515.
Proceedings the 2009 ACM ACM, 2009. on Applied Com-
symposium
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kar- puting, pages 15111515. ACM, 2009.
gupta.
[16] S. Distributed
Datta, K. Bhaduri, data
C. mining in peer-to-peer
Giannella, R. Wolff, andnetworks.
H. Kar- [32] Y. Kawahara and M. Sugiyama. Change-point detection in
IEEE
gupta.Internet Computing,
Distributed pages in
data mining 1826, 2006. networks.
peer-to-peer [32] time-series
Y. Kawaharadata andbyM.direct density-ratio
Sugiyama. estimation.
Change-point In Pro-
detection in
ceedings of data
time-series 2009bySIAM
directInternational
density-ratioConference
estimation.on In Data
Pro-
IEEE Internet Computing, pages 1826, 2006.
[17] J. Dean and S. Ghemawat. Mapreduce: Simplified data pro- Mining
ceedings(SDM2009),
of 2009 SIAM pages 389400, 2009.
International Conference on Data
cessing
[17] J. Dean andon S.
large clusters. Mapreduce:
Ghemawat. Communications of the
Simplified dataACM,
pro- Mining (SDM2009), pages 389400, 2009.
51(1):107113, [33] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in
cessing on large2008. clusters. Communications of the ACM,
51(1):107113, 2008. [33] data streams.
D. Kifer, In Proceedings
S. Ben-David, and J.of the Thirtieth
Gehrke. international
Detecting change in
[18] N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Effi- conference
data streams.onInVery large dataofbases-Volume
Proceedings the Thirtieth 30, page 191.
international
ciently
[18] N. correlating
Dindar, complexM.
P. M. Fischer, events
Soner, over
andlive
N. and archived
Tatbul. Effi- VLDB Endowment,
conference 2004.data bases-Volume 30, page 191.
on Very large
data streams.
ciently In ACM
correlating DEBS Conference,
complex 2011.and archived
events over live VLDB Endowment, 2004.
[34] A. Kim, C. Marzban, D. Percival, and W. Stuetzle. Using
data streams. In ACM DEBS Conference, 2011.
[19] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and [34] labeled
A. Kim,data to evaluate
C. Marzban, D.change detectors
Percival, and W.inStuetzle.
a multivariate
Using
P. Yu.
[19] G. Dong,Online mining
J. Han, of changes from
L. Lakshmanan, dataH.streams:
J. Pei, Wang, andRe- streaming
labeled dataenvironment. Signal detectors
to evaluate change Processing,in a89(12):2529
multivariate
search
P. Yu. problems and preliminary
Online mining of changesresults. Citeseer.
from data streams: Re- 2536, 2009.environment. Signal Processing, 89(12):2529
streaming
search
[20] A. problems J.and
R. Ganguly, preliminary
Gama, results. Citeseer.
O. A. Omitaomu, M. M. Gaber, 2536, 2009.
[35] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-
andR.
[20] A. R.Ganguly,
R. Vatsavai. Knowledge
J. Gama, O. A.discovery
Omitaomu,from
M.sensor data,
M. Gaber, [35] based change detection:
B. Krishnamurthy, Methods,
S. Sen, evaluation,
Y. Zhang, and applica-
and Y. Chen. Sketch-
volume 7. CRC, 2008.
and R. R. Vatsavai. Knowledge discovery from sensor data, tions. In Proceedings of the 3rd ACM SIGCOMM Confer-
based change detection: Methods, evaluation, and applica-
volume ence
tions.on
InInternet Measurement,
of the 3rdpages
ACM234247. ACM New
[21] V. Ganti,7.J.CRC, 2008.
Gehrke, and R. Ramakrishnan. A framework for York, NY,
Proceedings
USA, 2003.
SIGCOMM Confer-
measuring ence on Internet Measurement, pages 234247. ACM New
[21] V. Ganti, J. changes in data
Gehrke, and characteristics.AInframework
R. Ramakrishnan. Proceedings
for
of the eighteenth
measuring ACM
changes SIGMOD-SIGACT-SIGART
in data sympo-
characteristics. In Proceedings [36] York, NY, USA,
L. Kuncheva. 2003. detection in streaming multivariate
Change
sium
of theon Principles
eighteenth ACM of SIGMOD-SIGACT-SIGART
database systems, pages 126137.
sympo- [36] data using likelihood
L. Kuncheva. Changedetectors.
detectionKnowledge andmultivariate
in streaming Data Engi-
ACM,
sium on1999.
Principles of database systems, pages 126137. neering, IEEE
data using Transactions
likelihood on, Knowledge
detectors. (99):11, 2011.
and Data Engi-
ACM,
[22] V. 1999.
Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. A [37] neering, IEEE
X. Liu, X. Wu,Transactions
H. Wang, R.on, (99):11,
Zhang, 2011. and K. Ra-
J. Bailey,
framework
[22] V. Ganti, J.forGehrke,
measuringR. differences
Ramakrishnan, in dataand
characteristics.
W. Loh. A [37] mamohanarao.
X. Liu, X. Wu, Mining
H. Wang,distribution
R. Zhang,change in stock
J. Bailey, and K.order
Ra-
Journal of Computer and System Sciences, 64(3):542578,
framework for measuring differences in data characteristics. streams. Prof. of ICDE, pages 105108, 2010.
mamohanarao. Mining distribution change in stock order
2002.
Journal of Computer and System Sciences, 64(3):542578, [38] streams.
G. Manku Prof.
andof
R.ICDE, pages
Motwani. 105108, 2010.
Approximate frequency counts
2002.
[23] S. Geisler, C. Quix, and S. Schiffer. A data stream-based over data streams. In Proceedings of the 28th international
[38] G. Manku and R. Motwani. Approximate frequency counts
evaluation
[23] S. Geisler, framework
C. Quix, and forS.traffic information
Schiffer. systems. In
A data stream-based conference on Very Large Data of Bases, pages 346357.
over data streams. In Proceedings the 28th international
Proceedings of the ACM SIGSPATIAL International Work- VLDB Endowment, 2002.
evaluation framework for traffic information systems. In conference on Very Large Data Bases, pages 346357.
shop on GeoStreaming, pages 1118. ACM, 2010.
Proceedings of the ACM SIGSPATIAL International Work- [39] VLDB Endowment,
J. Manyika, 2002.
M. Chui, B. Brown, J. Bughin, R. Dobbs,
shop
[24] L. on GeoStreaming,
Golab, T. Johnson, pages 1118. ACM,
J. S. Seidel, and V.2010.
Shkapenyuk. C. Roxburgh, and A. Byers. Big data: The next frontier for
[39] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
Stream warehousing with datadepot. In Proceedings of the innovation, competition and productivity. McKinsey Global
[24] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. C. Roxburgh, and A. Byers. Big data: The next frontier for
35th SIGMOD international conference on Management of Institute, May, 2011.
Stream warehousing with datadepot. In Proceedings of the innovation, competition and productivity. McKinsey Global
data, pages 847854. ACM, 2009.
35th SIGMOD international conference on Management of [40] Institute,
A. Maslov, May,M. 2011.
Pechenizkiy, T. Krkkinen, and M. Thti-
data,J.pages
[25] A. Hey, 847854. ACM,
S. Tansley, and2009.
K. M. Tolle. The fourth nen. Quantile index for gradual and abrupt change detec-
[40] A. Maslov, M. Pechenizkiy, T. Krkkinen, and M. Thti-
paradigm: data-intensive scientific discovery. Microsoft tion from cfb boiler sensor data in online settings. In Pro-
[25] A. J. Hey, S. Tansley, and K. M. Tolle. The fourth nen. Quantile index for gradual and abrupt change detec-
Research Redmond, WA, 2009. ceedings of the Sixth International Workshop on Knowledge
paradigm: data-intensive scientific discovery. Microsoft tion from cfb boiler sensor data in online settings. In Pro-
Discovery from Sensor Data, pages 2533. ACM, 2012.
Research
[26] S. Redmond,
Hido, T. WA, 2009.
Id, H. Kashima, H. Kubo, and H. Matsuzawa. ceedings of the Sixth International Workshop on Knowledge
Unsupervised change analysis using supervised learning. [41] Discovery from Sensor
S. Muthukrishnan. DataData, pagesAlgorithms
streams: 2533. ACM,
and 2012.
applica-
[26] S. Hido, T. Id, H. Kashima, H. Kubo, and H. Matsuzawa.
In Proceedings of the 12th Pacific-Asia conference on Ad- tions. Now Publishers Inc, 2005.
Unsupervised change analysis using supervised learning. [41] S. Muthukrishnan. Data streams: Algorithms and applica-
vances in knowledge discovery and data mining, pages
In Proceedings
148159. of the 12th 2008.
Springer-Verlag, Pacific-Asia conference on Ad- [42] tions. Now Publishers
S. Muthukrishnan, Inc,den
E. van 2005.Berg, and Y. Wu. Sequential
vances in knowledge discovery and data mining, pages change detection on data streams. ICDM Workshops, 2007.
148159.
[27] S. Ho and Springer-Verlag, 2008. changes in unlabeled data
H. Wechsler. Detecting [42] S. Muthukrishnan, E. van den Berg, and Y. Wu. Sequential
streams using martingale. In Proceedings of the 20th in- [43] change
M. Naordetection
and L. on data streams.
Stockmeyer. ICDM
What can Workshops,
be computed 2007.
lo-
[27] S. Ho and H.joint
ternational Wechsler. Detecting
conference changesintelligence,
on Artifical in unlabeledpages
data cally? pages 184193, 1993.
streams [43] M. Naor and L. Stockmeyer. What can be computed lo-
19121917. Morgan Kaufmann Publishers Inc., 2007. in-
using martingale. In Proceedings of the 20th
[44] cally?
W. Ng pages
and M.184193,
Dash. A 1993.
change detector for mining frequent
ternational joint conference on Artifical intelligence, pages
19121917.
[28] W. Huang, E.Morgan Kaufmann
Omiecinski, and L. Publishers Inc., 2007.
Mark. Evolution in Data patterns over evolving data streams. In Systems, Man and
[44] Cybernetics,
W. Ng and M.2008.Dash.SMC
A change
2008. detector for mining frequent
IEEE International Confer-
Streams. 2003. patterns
[28] W. Huang, E. Omiecinski, and L. Mark. Evolution in Data ence on, over
pagesevolving data IEEE,
24072412. streams. In Systems, Man and
2008.
Streams. 2003. Cybernetics, 2008. SMC 2008. IEEE International Confer-
ence on, pages 24072412. IEEE, 2008.

SIGKDD Explorations Volume 16, Issue 1 Page 37


SIGKDD Explorations Volume 16, Issue 1 Page 37
[45] W. Ng and M. Dash. A test paradigm for detecting changes [53] G. Ross, D. Tasoulis, and N. Adams. Online annotation and
in transactional
[45] W. data streams.
Ng and M. Dash. In Database
A test paradigm Systemschanges
for detecting for Ad- [53] prediction
G. Ross, D.for regimeand
Tasoulis, switching dataOnline
N. Adams. streams. In Proceed-
annotation and
vanced Applications,
in transactional pages 204219.
data streams. Springer,
In Database 2008.
Systems for Ad- ings of the for
prediction 2009 ACMswitching
regime symposium on streams.
data Applied InComputing,
Proceed-
vanced Applications, pages 204219. Springer, 2008. pages
ings of15011505.
the 2009 ACM ACM, 2009.
symposium on Applied Computing,
[46] D. Nikovski and A. Jain. Fast adaptive algorithms for abrupt
pages 15011505. ACM, 2009.
change
[46] D. detection.
Nikovski and A.Machine learning,
Jain. Fast adaptive79(3):283306, 2010.
algorithms for abrupt [54] A. Singh. Review Article Digital change detection tech-
change detection. Machine learning, 79(3):283306, 2010. [54] niques using
A. Singh. remotely-sensed
Review data. International
Article Digital Journal
change detection of
tech-
[47] R. Niu and P. K. Varshney. Performance analysis of dis-
tributed detection Remote Sensing,
niques using 10(6):9891003,
remotely-sensed data.1989.
International Journal of
[47] R. Niu and P. K. in a randomPerformance
Varshney. sensor field. analysis
Signal Process-
of dis-
ing, IEEE
tributed Transactions
detection on, 56(1):339349,
in a random 2008. Process-
sensor field. Signal Remote Sensing, 10(6):9891003, 1989.
[55] X. Song, M. Wu, C. Jermaine, and S. Ranka. Statistical
ing, IEEE Transactions on, 56(1):339349, 2008.
[48] R. Niu, P. K. Varshney, and Q. Cheng. Distributed detec- [55] change
X. Song,detection
M. Wu,for C. multi-dimensional data. In Statistical
Jermaine, and S. Ranka. Proceed-
tionNiu,
[48] R. in a P.
large
K. wireless
Varshney,sensor network.
and Q. Cheng.Information
DistributedFusion,
detec- ings of detection
change the 13th ACM SIGKDD international
for multi-dimensional data. Inconference
Proceed-
7(4):380394, 2006. sensor network. Information Fusion,
tion in a large wireless on
ingsKnowledge
of the 13th discovery and datainternational
ACM SIGKDD mining, pagesconference
667676.
7(4):380394, ACM, 2007. discovery and data mining, pages 667676.
on Knowledge
[49] T. Palpanas, 2006.
D. Papadopoulos, V. Kalogeraki, and ACM, 2007.
D. Gunopulos.
[49] T. Palpanas, Distributed deviation detection
D. Papadopoulos, in sensor net-
V. Kalogeraki, and [56] D.-H. Tran and K.-U. Sattler. On detection of changes in
works. ACM SIGMOD
D. Gunopulos. Record,
Distributed 32(4):7782,
deviation detection2003.
in sensor net- [56] sensor data streams.
D.-H. Tran and K.-U. In Proceedings of the 9thof
Sattler. On detection International
changes in
works.Pham,
ACM SIGMOD Record, Conference on Advances
sensor data streams. in Mobile of
In Proceedings Computing and Multi-
the 9th International
[50] D.-S. S. Venkatesh, M. 32(4):7782, 2003.
Lazarescu, and S. Budha- media, pageson5057. ACM,
ditya. Pham,
Anomaly detection inM.large-scale Conference Advances in 2011.
Mobile Computing and Multi-
[50] D.-S. S. Venkatesh, Lazarescu,data
andstream net-
S. Budha- media, pages 5057. ACM, 2011.
works. Data Mining
ditya. Anomaly and Knowledge
detection Discovery,
in large-scale 28(1):145
data stream net- [57] M. van Leeuwen and A. Siebes. Streamkrimp: Detecting
189, 2014.
works. Data Mining and Knowledge Discovery, 28(1):145 [57] change
M. van in data streams.
Leeuwen and A.Machine
Siebes.Learning and Knowledge
Streamkrimp: Detecting
189,Rajagopal,
2014. Discovery in Databases,
change in data pages 672687,
streams. Machine Learning 2008.
and Knowledge
[51] R. X. Nguyen, S. C. Ergen, and P. Varaiya.
Distributed online simultaneous Discovery in Databases, pages 672687, 2008.
[51] R. Rajagopal, X. Nguyen, S. C.fault detection
Ergen, and P.forVaraiya.
multi- [58] V. Veeravalli and P. Varshney. Distributed inference in wire-
ple sensors. In Information Processing in Sensor Networks,
Distributed online simultaneous fault detection for multi- [58] less sensor networks.
V. Veeravalli Philosophical
and P. Varshney. DistributedTransactions
inference inofwire-
the
2008. IPSN08.
ple sensors. International
In Information Conference
Processing on, pages
in Sensor 133
Networks, Royal Societynetworks.
A: Mathematical, Physical and Engineering
less sensor Philosophical Transactions of the
144. IEEE, 2008.
2008. IPSN08. International Conference on, pages 133 Sciences, 370(1958):100117, 2012.
Royal Society A: Mathematical, Physical and Engineering
144.Roddick,
[52] J. IEEE, 2008.L. Al-Jadir, L. Bertossi, M. Dumas, Sciences, 370(1958):100117, 2012.
[59] A. Zhou, F. Cao, W. Qian, and C. Jin. Tracking clusters
[52] J. Roddick, L.K. Al-Jadir,
H. Gregersen, Hornsby, L.J. Bertossi,
Lufter, F.M.Mandreoli,
Dumas, [59] in
A. evolving
Zhou, F.data
Cao,streams overand
W. Qian, sliding windows.
C. Jin. Knowledge
Tracking clusters
T. Mannisto, E. Mayol, et al. Evolution
H. Gregersen, K. Hornsby, J. Lufter, and F.
change in data
Mandreoli, and Information Systems, 15(2):181214, 2008.
management:issues andetdirections. ACM in evolving data streams over sliding windows. Knowledge
T. Mannisto, E. Mayol, al. Evolution andSigmod
changeRecord,
in data and Information Systems, 15(2):181214, 2008.
29(1):2125, 2000. and directions. ACM Sigmod Record,
management:issues
29(1):2125, 2000.

SIGKDD Explorations Volume 16, Issue 1 Page 38


SIGKDD Explorations Volume 16, Issue 1 Page 38
Contextual Crowd Intelligence

Beng Chin Ooi , Kian-Lee Tan , Quoc Trung Tran , James W. L. Yip ,
Gang Chen# , Zheng Jye Ling , Thi Nguyen , Anthony K. H. Tung , Meihui Zhang
National University of Singapore National University Health System # Zhejiang University
{ooibc, tankl, tqtrung, thi, atung, zmeihui}@comp.nus.edu.sg
{james yip, zheng jye ling}@nuhs.edu.sg # cg@zju.edu.cn

ABSTRACT data-driven solution employs machine learning algorithms to


derive risk factors solely from observational data. An alternative
Most data analytics applications are industry/domain specific, e.g.,
approach combines the knowledge-driven and data-driven
predicting patients at high risk of being admitted to intensive care
approaches in the data analytics applications [31]. However, there
unit in the healthcare sector or predicting malicious SMSs in the
are exceptional situations where it is not easy to capture or
telecommunication sector. Existing solutions are based on best
formalize, and where neither general guidelines are available nor
practices, i.e., the systems decisions are knowledge-driven
rules can be derived from data (e.g., in rare conditions). Instead, it
and/or data-driven. However, there are rules and exceptional cases
is only through many years of experience can subject-matter
that can only be precisely formulated and identified by
experts (SMEs) formulate and identify these situations. The
subject-matter experts (SMEs) who have accumulated many years
challenge then is to be able to capture and utilize such knowledge
of experience. This paper envisions a more intelligent database
to effectively support industry/domain specific applications, e.g.,
management system (DBMS) that captures such knowledge to
improving the accuracy of the prediction tasks.
effectively address the industry/domain specific applications. At
the core, the system is a hybrid human-machine database engine This paper proposes building the next generation of intelligent
where the machine interacts with the SMEs as part of a feedback database management systems (DBMSs) that exploit contextual
loop to gather, infer, ascertain and enhance the database crowd intelligence. The crowd intelligence here refers to the
knowledge and processing. We discuss the challenges towards knowledge and experience of subject-matter experts (SMEs).
building such a system through examples in healthcare predictive Although such knowledge is an important component in
analysis a popular area for big data analytics. transforming data into information, it is currently not captured by
a structured system. The participants in an intelligent crowd are
domain experts rather than unknown lay-persons in existing
1. INTRODUCTION systems that use crowdsourcing as part of database query
Most data analytics applications are industry or domain specific. processing (e.g., CrowdDB [13], Deco [24], Qurk [23],
For example, many prediction tasks in healthcare require prior CDAS [12; 22]) and information extraction or knowledge
medical knowledge, such as, identifying patients at high risk of acquisition (e.g., HIGGINS [21] and CASTLE [28]). For
being admitted to the intensive care unit, or predicting the applications where data confidentiality and privacy are important
probability of the patients being readmitted into the hospital (e.g., healthcare analytics), the intelligent crowd may consist of
within 30 days after discharge. Another example from the only experts from within the organization, since the tasks cannot
telecommunication sector is the identification of malicious SMSs be outsourced to external parties. Given that the crowd is known
requiring inputs from security experts. Building competent tools apriori, there is an assurance of user accountability, which
to effectively address these problems are important, as industrial translates to an assurance in the quality of the answers. A recent
organizations face increasing pressures to improve outcomes while system, called Data Tamer [30], also proposed to leverage on
reducing costs [3]. expert crowdsourcing system to enhance machine computation but
Existing solutions to industry or domain specific tasks are based in the context of data curation. Our proposition differs from Data
on best practices. These solutions are knowledge-driven (i.e., Tamer in several aspects. First, the target applications of our work
utilizing general guidelines such existing clinical guidelines or (i.e., data analytics) are different from those in Data Tamer (i.e.,
literature from medical journals) and/or data-driven (i.e., deriving data curation). Thus, each system needs to address a unique,
rules from observational data) [31]. Let us consider the task of different set of challenges. Second, the domain experts in our
identifying the risk factors related to heart failure. The context are also users/reviewers of the system. Thus, the experts
knowledge-driven solution uses risk factors identified from are likely to take ownership and hence are motivated to improve
existing clinical knowledge or literature, such as, age, the accuracy of the analytics and the usability of the applications.
hypertension and diabetes status. However, it may miss out other This would reduce the need to localize/customize the system since
unknown risk factors specific to the population of interest. The the experts/users are continuously interacting with the system;
reason is that the guidelines are generic and based on existing these experts define the best practices for the system. For
knowledge, which results in models that may not adequately example, doctors in a particular department may use a different
represent the underlying complex disease processes in the convention or notation from another department, e.g., when
population with a comprehensive list of risk factors [31]. The doctors write PID in the orthopedic department, the acronym
refers to the Prolapsed Intervertebral Disc only and not the
Pelvic Inflammatory Disease. Clearly, such knowledge can only

SIGKDD Explorations Volume 16, Issue 1 Page 39


be provided by internal domain experts. In contrast, experts in high risk of being admitted to intensive care unit, or predicting the
Data Tamer are not the users of the system and hence there is a probability of the patients being readmitted into the hospital soon
need to customize/localize the system for different use-cases. after discharge. There are also queries that monitor real-time data
In order to entrench the crowd intelligence into the DBMS, the of patients in critical conditions for unusual conditions, such as,
system needs to keep SMEs as part of the feedback loop. The whether patients are at high risk of collapsing. With correct
system can then further utilize feedback provided from the SMEs predictions, doctors can intervene early to alleviate the
to infer, ascertain and enhance its processing, thus continuously deterioration of patients health outcome. This can potentially
improving the effectiveness of the system. For example, when reduce the burden of limited healthcare resources in the primary
predicting the risk of unplanned patient readmissions, the system and acute care facilities. For instance, if a patient is at high-risk
asks the doctors to label patients who the system has low for unplanned post discharge readmission, he can potentially
confidence in predicting their readmissions, and the benefit from close followed-up after discharge, e.g., the hospital
rules/hypotheses that the doctors used to do the labeling. One sends a case manager or nurse to examine him once every three
example of such an expert rule is that an elderly patient who lives days. In addition, important queries related to public health
alone and have had several severe diseases is likely to be surveillance can be answered in a timely fashion. For example, it
readmitted into the hospital frequently. The system would then is critical to provide real-time, early information to alert
verify or adjust these rules/hypotheses and revert back to the decision-makers of emerging threats that need to be addressed in a
doctors with evidence to support or reject their rules/hypotheses. particular population. The ultimate goal of these predictive queries
Such interactions are beneficial to both the system and the doctors. is to predict, pre-empt and prevent for better healthcare outcome.
Eventually, the application system evolves over time. SMEs
become part of this evolving process by sharing their domain 3. AN INTELLIGENT DBMS FOR BIG DA-
knowledge and rich experience, thereby contributing to the
improvement and development of the system. Hence, the experts TA ANALYTICS
are more willing and comfortable to use the system to alleviate the In this section, we discuss the challenges of addressing big data
burden of their duties. analytics and present an overview of a hybrid human-machine
This work is part of our CIIDAA project on building large scale, system for these tasks.
Comprehensive IT Infrastructure for Data-intensive Applications
and Analysis [2]. Our collaborators are clinicians in the National
3.1 Challenges of Big Data Analytics
University Health System (NUHS) [5]. The project aims to harness Essentially, many tasks of big data analytics can be viewed as
the power of cloud computing to solve big data problems in the real conventional data mining problems, such as, classifying patients
world, with healthcare predictive analytics being a popular area for into different class labels (high or low risk of being admitted to
big data analytics [26]. intensive care units). There are, however, three important aspects
that differentiate big data analytics from traditional machine
Organization. The remainder of this paper is organized as learning problems.
follows. Section 2 presents motivating examples in healthcare
predictive analytics. Section 3 discusses the architecture of an First, many valuable features for the analytics tasks are
intelligent DBMS that aims to embed contextual crowd stored in unstructured data, for example, doctors notes [25].
intelligence. Section 4 elaborates on research problems that we We cannot simply treat these notes as traditional
need to address in order to build an intelligent DBMS. Section 5 bag-of-words documents. Instead, we need powerful tools
presents our preliminary results on the problem of predicting the to extract from these documents the right entities (such as,
risk of unplanned patient readmissions. Section 6 presents the diseases, medications, laboratory tests) and domain-specific
related work. Finally, Section 7 concludes our work. relationships (such as, the relationship between a disease
and a laboratory test). The text in unstructured data has to
2. MOTIVATING EXAMPLES be contextualized to each organizations practice, e.g.,
Let us consider a hospital that has an integrated view of the medical doctors in a particular department may use a different
care records of patients as shown in Table 1. The table contains two convention or notation from another department.
types of information:
Second, there is usually a lack of training samples with
Structured information, including the case identifier, well-defined class labels. For instance, when predicting the
patients name, age, gender, race, the number of days that risk of committing suicide for each patient, the total number
the patient stayed at the hospital during a particular visit patients known to have committed suicide (i.e., class 1) is
(LengthO f Stay), and the number of days before the patient very small. However, it does not mean that all the remaining
was readmitted into the hospital after discharge patients did not commit suicide (i.e., class 0). Hence we
(Readmission) ; and need to infer the correct class labels for these patients. This
problem also occurs in other domains such as home security
Unstructured information, i.e., free-text from a doctors note
and banking. For example, one important task that many
that contains additional and useful information of a patient
national security agencies need to perform is identifying
healthcare profile such as his past medical history, social
persons or groups of people who will likely commit a
factors, previous medications, complaints of patients based
crime [4]. In this setting, the agency maintains a very small
on a doctors investigations, major lab results, issues and
set of people who have committed crime. However, we
progress, etc.
cannot simply assume that the remaining people are not
The tuples in this table are extracted from real cases of patients likely to commit crime. As before, we need to infer the
admitted to the National University Hospital (NUH) in Singapore. correct class labels for these people. Another example is in
Healthcare professionals often have queries relating to predicting telecommunication, where a service provider wants to
the severity of patients condition, such as, identifying patients at predict whether an SMS is malicious. In this case, we do not

SIGKDD Explorations Volume 16, Issue 1 Page 40


CaseID Name Age Gender Race LengthOfStay Readmission Doctors note
PMH:
1 IHD
- on GTN 0.5mg prn
2 DM
Case 1 Patient 1 71 Female Chinese 5 20 - on Metformin 750mg
- HbA1c 7.5% 09/12
3 HL
Stays with son
Social issues: Single, no child
Used to live with friend in a shophouse
Case 2 Patient 2 60 Male Malaysian 10 20 Now at sheltered home since Sept 2011.
No next-of-skin or visitor.

Table 1: Medical care table

have any predefined class labels and might need to ask Hadoop Distributed File System (HDFS) and a key-value storage
security experts to provide the class labels for some sample system, ES2 [8]) for both unstructured and structured data. The
cases. next layer (which is the security layer) enables users to protect
data privacy by encryption. The third layer (which is the
Lastly, data in different domains (e.g., healthcare, distributed processing layer) provides a distributed processing
telecommunication, home security) is expected to grow infrastructure called E3 [9] that supports different parallel
dramatically in the years ahead [26]. For instance, patients processing logics such as MapReduce [11], Directed Acyclic
in intensive care units are constantly being monitored, and Graph (DAG) and SQL. The top layer (which is the analytics
their historical records have to be retained. This can easily layer) exploits the contextual crowd intelligence for big data
result in hundreds of millions of (historical) records of analytics. The details of this layer are shown in Figure 1. In
patients. As another example, during a mass casualty Figure 2, KB is the knowledge base and iCrowd is the component
disaster (e.g., SARS, H5N1), there is an overwhelming that interacts with the domain experts. Different components of
number of patients who have to be monitored and tracked, the analytics layer (e.g., scalable machine learning algorithms) can
and information about each patient is huge by itself. process their data with the most appropriate data processing model
Furthermore, streaming data arrive continuously, e.g., new and their computations will be automatically executed in parallel
data from the real-time data feed are constantly being by the lower layers.
inserted. Hence, the system in healthcare setting must In the remaining of this paper, we focus only on the analytics layer.
provide the real-time predictions, e.g., predicting the For more details of the other layers of the epiC system, please refer
survival of patients in the next 6 hours. to [1; 10; 19].
The three above mentioned aspects call for a new generation of
intelligent DBMSs that can provide effective solutions for big data 4. RESEARCH PROBLEMS
analytics. Our proposition of exploiting contextual crowd In this section, we elaborate on the research problems that we need
intelligence is, we believe, a big step towards this goal. to address in order to build an intelligent system for big data
analytics.
3.2 Contextual Data Management
The central theme of crowd intelligence is to get domain experts 4.1 Asking Experts The Right Questions
engaged as both the participants to fine tune the system and the Given a large volume of data and a limited amount of time that
end-users of the system. Figure 1 presents an intelligent system domain experts can participate in building the systems, we need
that exploits contextual crowd intelligence for big data analytics. to ask the experts the right questions. In the context of healthcare
The system first builds a knowledge base that will be subsequently analytics, we plan to ask the following domain knowledge from
used for the analytics tasks based on historical data, domain doctors.
knowledge from SMEs (e.g., doctors), and other sources such as
general clinical guidelines. Each source contributes to build some Labelings. The system asks doctors to label tuples that the
weak classifiers. The system needs to combine these classifiers system has low confidence in performing the prediction
to derive a final classifier that achieves a high level of accuracy for task. There are two important issues here. First, doctors
prediction purposes. The system also needs to go through several have different levels of confidence when answering different
iterations of interaction with the experts to refine, for example, the questions, i.e., doctors are reluctant to assess patient profiles
final classifier. As such, the experts participate in the entire that they do not have specialties. Second, since there is so
process in fine tuning the system and decide on the best much information about patients, selecting the relevant
practices. When real-time data or feed arrives, the system feature of each patient to present to the doctors in order not
performs the prediction on-the-fly and alerts the experts to overwhelm them is also a major issue.
immediately. Hence, the experts become the end-users of the In essence, what we need is a diverse set of labeled patients
system. that covers the whole data space as much as possible. One
We have developed the epiC system [1; 10; 19] to support large possible solution is to group similar patient profiles together
scale data processing, and are extending it to support healthcare and show these groups to doctors. The purpose is to let the
analytics. Figure 2 shows the software stack of epiC. At the doctors select the groups of patients that they are
bottom, the storage layer supports different storage systems (e.g., comfortable in providing the labels. In addition, for each

SIGKDD Explorations Volume 16, Issue 1 Page 41


Classifer KB iCrowd Analytics Layer
Answers

Historical data
Classifier from Classifier from Questions
SMEs data MR DAG SQL
Distributed
Classifier from Classifier from Processing Layer
general guidelines other sources
E3
Derive
Update SMEs
Real-time data/feed
Trusted Data Service Security Layer
Predictor Knowledge
Base

Recommendations Storage Layer


HDFS ES2
Feedback

Figure 1: Contextual crowd intelligence for big data analytics. Figure 2: The software stack of epiC for big data analyt-
ics.

group, we present only the features which the patients in the lab tests) from various medical dictionaries (a.k.a. knowledge
group have similar values. In this way, we can avoid base), such as, the Unified Medical Language System (UMLS) [6].
overwhelming the doctors with information. Note that, in We now discuss several problems raised due to the nature of the
some cases, we need to perform hierarchical clustering to unstructured data and the incompleteness of the knowledge base,
reduce the number of patients shown to the doctors each and subsequently discuss a hybrid human-machine approach to
time. Selecting the right clustering algorithms and solve these problems. The discussion uses the following running
developing effective visualization tools to present patients example. We run cTAKES on the doctors note of patient 1 (in
profiles are important here. Table 1), and obtain the following clinical entities: (1) diseases:
IHD (Ischemic Heart Disease) and DM; (2) medications: GTN
Rules/Hypotheses. The system collects expert
and Metformin; and (3) laboratory test: HbA1c.
rules/hypotheses that the doctors used to do the labeling.
For example, to predict the risk of unplanned patient Ambiguous mentions. In many cases, a mention in the free text
readmissions, the doctors suggested a hypothesis that social may refer to different domain entities. For instance, in the running
factors and the status of the diseases are important risk example, DM refers to two different diseases Dystrophy
indicators for readmission. The system would then verify or Myotonic and Diabetes Mellitus. We note that this problem is
adjust thes