Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Algorithm tells lfestories

Algorithm tells lfestories

Ratings: (0)|Views: 377|Likes:
Published by Herman Couwenbergh
An algorithm is capable of telling your life story
An algorithm is capable of telling your life story

More info:

Published by: Herman Couwenbergh on Oct 14, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/14/2014

pdf

text

original

 
Timeline Generation: Tracking individuals on Twitter
Jiwei Li
School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213
bdlijiwei@gmail.comClaire Cardie
Department of Computer ScienceCornell UniversityIthaca, NY 14850
cardie@cs.cornell.edu
ABSTRACT
We have always been eager to keep track of what happened peo-ple we are interested in. For example, businessmen wish to knowthe background of his competitors, fans wish getting the first-timenews about their favorite movie stars or athletes. However time-line generation, for individuals, especially ordinary individuals (notcelebrities), still remains an open problem due to the lack of avail-able data. Twitter, where users report real-time events in theirdaily lives, serves as a potentially important source for this task.In this paper, we explore the task of 
individual timeline genera-tion
from twitter and try to create of a chronological list of 
per-sonal important events
(PIE) of individuals based on the tweetsone published. By analyzing individual tweet collection, we findthat what are suitable for inclusion in the personal timeline shouldbe tweets talking about personal (opposite to public) and time-specific (time-general) topics. To further extract these types of top-ics, we introduce a non-parametric model, named Dirichlet Processmixture model (DPM) to recognize four types of tweets: personaltime-specific (PersonTS), personal time-general (PersonTG), pub-lic time-specific (PublicTS) and public time-general (PublicTG)topics, which, in turn, are used for further personal event extractionand timeline generation. For evaluation, we build up a new goldenstandard Timelines based on Twitter and Wikipedia that containPIE related events from 20
ordinary twitter users
and 20
celebri-ties
. Experiments on real Twitter data quantitatively demonstratethe effectiveness of our method.
Categories and Subject Descriptors
H.0 [
Information Systems
]: General
General Terms
Algorithm, Performance, Experimentation
Dr. Trovato insisted his name be first.
The secretary disavows any knowledge of this author’s actions.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Keywords
Event extraction, Individual Timeline Generation, Dirichlet Pro-cess, Twitter
1. INTRODUCTION
We have always been eager to keep track of people we are in-terested in. Businessmen want to know the past of his competitorsto exactly know who he is competing with. Employees want toget access to what happened to their boss, so then can behave in amuch properer way. More generally, fans, especially young people,are always crazy about getting the first-time news about their fa-vorite movie stars or athletes. To date, however, building a detailedchronological list or table of key events in the lives of individualremains largely a manual task. Existing automatic techniques forpersonal event identification, for example, rely on documents pro-duced via a web search on the person’s name [1, 6,12, 31]. The web search based approaches not only suffers from the failure toinclude important information neglected by the online media, butmore importantly, are restricted to celebrities, whose informationis collected by online media, and can never be applied to ordinaryindividuals.Fortunately, Twitter
1
,a popular social network, serves as an al-ternative, and potentially very rich, source of information for thistask: people usually publish tweets describing their lives in detailor chat with friends on twitter. Figures1and2give examples of  users talking about what happened to them on twitter. The first onecorresponds to a NBA basketball player, Dwight Howard
2
tweetsabout being signed by basketball franchise Houston Rockets andthe other one corresponds to an ordinary twitter user, recording herbeing accepted by Harvard University on twitter. In particular, weknow of no previous work that exploits Twitter to track events inthe lives of individuals.To do so, we have to answer the following question: what typesof events reported on Twitter by an individual should be regardedas
personal, important events (PIE)
and, thus, be suitable for in-clusion in the event timeline of an individual? In the current work,we specify three criteria for PIE extraction. First, each PIE shouldbe an
important 
event, an event that is made multiple references byan individual and his or her followers. Second, each PIE should bea
time-specific
event — a unique (rather than a general, recurring)event that is delineated by specific start and end points. Consider,for instance, the twitter user in Figure2, she frequently publishedtweets about being accepted by Harvard only when receiving ac-ceptance letter. As a result, these tweets refer to a time-specificPIE. In contrast, her exercise regime, about which she tweets reg-
1
https://twitter.com/ 
2
https://twitter.com/DwightHoward
  a  r   X   i  v  :   1   3   0   9 .   7   3   1   3  v   1   [  c  s .   S   I   ]   2   7   S  e  p   2   0   1   3
 
Figure 1: Example of PIE for for basketball star DwightHoward. Tweetlabeledinred: PIEaboutDwightjoiningHous-ton RocketFigure 2: Example of PIE for an ordinary twitter user. Tweetlabeled in red: PIE about getting accepted by Harvard Univer-sity.
ularly (e.g. “11.5 km bike ride?, “15 mins Yoga stretch?), is notconsidered a PIE — it is more of a general interest.Third, the PIEs identified for an individual should be
personal
events (i.e. an event of interest to himself or to his followers) ratherthan events of interest to the general public. For instance, mostpeople pay attention to and discuss about public events such as “theU.S. election". For an ordinary person, we do not want “the U.S.election" to be identified as a PIE no matter how frequently he orshe tweets about it; it remains a public event, not a personal one.However, things become a bit complicated because of the publicnature of stardom: sometimes an otherwise public event can con-stitute PIE for the celebrity — e.g. “the U.S. election" should notbe treated as a PIE for ordinary individuals, but be treated as a PIEtime-specific time-generalpublic PublicTS PublicTGpersonal PersonTS PersonTG
Table 1: Types of tweets on Twitter
for Barack Obama and Mitt Romney.Given the above criteria, we aim to characterize tweets into oneof four event types: public time-specific (PublicTS), public time-general(PublicTG),personaltime-specific(PersonTS)andpersonaltime-general (PersonTG), as shown in Table1.In doing so, we canthen identify PIEs related events based on the following criterion:1.
For an ordinary twitter user, the PIEs would be his or here Per-sonTS events
.2.
For a celerity, the PIEs would be PersonTS events along with hisor her celebrity-related PublicTS events.
Topic extraction (both local and global) on twitter is not a newtask. Among existing approaches, Bayesian topic models such asLDA[3], LabeledLDA[24] and HDP[29] have widely been used in topic mining task in twitter due to the ability of mining latent topicshidden in tweet dataset[7,9, 13, 18,20, 23, 25, 35]. Topic models provide a principled way to discover the topics hidden in a text col-lection and seems well suited for our personal information analysistask.Based on topic approaches, in this paper, we introduce a non-parametric topic model, named multi-level Dirichlet Process Mix-ture Model (DPM) to identify tweets associated with the PublicTS,PublicTG, PersonTS and PersonTG events of individual Twitterusers by modeling the combination of temporal information (to dis-tinguish time-specific from time-general events) and user informa-tion (to distinguish public from private events) in the joint Twitterfeed. The point of DP mixture model is to allow components (ortopics) shared across corpus while the specific level (i.e user andtime) information would be emphasized. Further based on topicdistribution from DPM model, we characterize events (topics) ac-cording to criterion mentioned above and select the tweet that bestrepresents each PIE topics into timeline.To evaluate our approach, we manually generate gold-standardPIE timelines. Since criteria for ordinary twitter users and celeritiesare a little bit different (whether related PublicTS should be consid-ered), we generate two PIE timelines, one for ordinary twitter userscalled
TwitSet
O
and theother called
TwitSet
for celebritytwitter users, both of which include 20 people from their respectiveTwitter stream. The PIE timelines cover a 21-month interval dur-ing 2011-2013. For celerities, in addition to Twitter stream, wealso generate golden-standard timeline
WikiSet
according toWikipedia entries. In sum, this research makes the following maincontributions:
Wecreategolden-standardtimelinesthatcontainPIEsforfamoustwitter users and ordinary twitter users based on Twitter streamand Wikipedia (Detail see Section 5.2). To the best of our knowl-edge, our dataset is the first golden-standard for personal eventextraction or timeline generation in Twitter.
We introduce a non-parametric algorithm based on Dirichlet Pro-cess for individual timeline generation on Twitter stream. Theperformance of our approach outputs multiple baselines. It canbe extended to any individual, (e.g. friend, competitor or moviestar), if only he or she has a twitter account.The remainder of this paper is organized as follows: Section2briefly introduces Dirichlet Processes and Hierarchical DirichletProcesses. Section4presents algorithm for timeline generation and
 
Section5describes our dataset and creation of Gold-standard time-lines. Section6presents the experimental results. Section7briefly discusses related work and we conclude this paper in Section8.
2. DP AND HDP
In this section, we briefly introduce DP and HDP. Dirichlet Pro-cess(DP) can be considered as a distribution over distributions [8].A DP denoted by
DP 
(
α,G
0
)
is parameterized by a base measure
G
0
and a concentration parameter
α
. We write
G
DP 
(
α,G
0
)
for a draw of distribution
G
from the Dirichlet process. Sethura-man [28]showed that a measure
G
drawn from a DP is discrete bythe following
stick-breaking construction
.
{
φ
k
}
k
=1
G
0
, π
GEM 
(
α
)
, G
=
k
=1
π
k
δ 
φ
k
(1)The discrete set of atoms
{
φ
k
}
k
=1
are drawn from the base mea-sure
G
0
.
δ 
φ
k
istheprobabilitymeasureconcentratedat
φ
k
. GEM(
α
)refers the following process:
ˆ
π
k
Beta
(1
,α
)
, π
k
= ˆ
π
kk
1
i
=1
(1
ˆ
π
i
)
(2)We successively draw
θ
1
,θ
2
,...
from measure
G
. Let
m
k
denotesthe number of draws that takes the value
φ
k
. After observing draws
θ
1
,...,θ
n
1
from G, the posterior of G is still a DP shown as fol-lows:
G
|
θ
1
,...,θ
n
1
DP 
(
α
0
+
n
1
,m
k
δ 
φ
k
+
α
0
G
0
α
0
+
n
1)
(3)HDP uses multiple DP s to model multiple correlated corpora. InHDP, a global measure is drawn from base measure
. Each docu-ment
d
is associated with a document-specific measure
G
d
which isdrawn from global measure
G
0
. Such process can be summarizedas follows:
G
0
DP 
(
α,
)
G
d
|
G
0
,γ 
DP 
(
γ,G
0
)
(4)Given
G
j
, words
w
within document
d
are drawn from the follow-ing mixture model:
{
θ
w
}
G
d
, w
Multi
(
w
|
θ
w
)
(5)Eq.(4)andEq.(5)togetherdefinetheHDP.AndaccordingtoEq.(1),
G
0
has the form
G
0
=
k
=1
β 
k
δ 
φ
k
, where
φ
k
,
β 
GEM 
(
α
)
. Then
G
d
can be constructed as
G
d
k
=1
π
dk
δ 
φ
k
, φ
j
|
β,γ 
DP 
(
γ,β 
)
(6)
Figure 3: Graphical model of (a) DP and (b) HDP
3. DPM MODEL
In this section, we get down to the details of DPM model. Sup-pose that we collect tweets from
users. Each user’s tweet streamis segmented into
time periods. Here, each time period denotesa week.
ti
=
{
v
tij
}
j
=
n
ti
j
=1
denotes the collection of tweets that user
i
publishes, retweets or is @ed during epoch
t
. Each tweet
v
iscomprised of a series of words
v
=
{
w
i
}
n
v
i
=1
where
n
v
denotes thenumber of words in current tweet.
i
=
t
ti
denotes the tweetcollection published by user
i
and
t
=
i
ti
denotes the tweetcollection published at time epoch t.
is the vocabulary size.
3.1 DPM Model
In DPM, each tweet
v
is associated with parameter
x
v
y
v
,
v
,respectively denoting whether it is Public or Personal, whether itis time-general or time-specific and its topic
3
.We use 4 differentkindsofmeasures, whichcanbeinterpretedasdifferentdistributionover topics, to model the four different types of tweets accordingto their
x
and
y
value. Each measure presents unique distributionover topics.Suppose that
v
is published by user
i
at time
t
.
x
v
and
y
v
con-form to the binomial distribution with parameter
π
ix
and
π
iy
, whichcan be interpreted as a user’s preferencefor publishing tweets aboutpersonal or pubic information, time-general or time-specific infor-mation.
x
v
and
y
v
conform to the binomial distribution
π
ix
and
π
ty
with a Beta prior
η
x
and
η
y
, where
π
ix
Beta
(
η
x
)
,
π
ty
Beta
(
η
y
)
.y=0 y=1x=0 PublicTG PublicTSx=1 PersonTG PersonTS
Table 2: Tweet type according to x (public or personal) and y(time-general or time-specific)
The key of DPM model is how to model different measures (ortopic distribution) over topics with regard to user and time infor-mation (different value of 
x
and
y
). Our intuitive idea is as shownin Figure4.There is a global measure
G
0
, which is drawn frombase measure
.
G
0
is the measure over topics that any user at anytime can talk about. A PublicTG topic (x=0, y=0) would be directlydrawn from
G
0
(also written as
G
(0
,
0)
). For each time
t
, there isa time-specific measure
G
t
(also written as
G
(0
,
1)
) that describestopics discussed just at that time.
G
t
is drawn from the global mea-sure
G
0
. Similarly, from each user
i
, a user-specific
G
i
measure(also written as
G
(1
,
0)
) is drawn from
G
0
. Further, Personal-time-sepcific topic measure
G
ti
(
G
(1
,
1)
) is drawn from
G
i
. As we cansee, all tweets from all users across all time epics share the sameinfinite set of mixing components (or topics). The difference liesin the mixing weights in the four types of measure
G
0
,
G
t
,
G
i
and
G
ti
. The whole point of DP mixture model is to allow sharing com-ponents across corpus while the specific level (i.e user and time) of information can be emphasized. The plate diagram and generativestory are illustrated in Figures4and5.
α,γ,µ
and
κ
are hyper-parameters for Dirichlet Processes.
Figure 4: Graphical illustration of DPM model.
3
We follow the work of Grubber et al.,(2007) and assume that allwords in a tweet are generated by a same topic.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->