Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Twitter Flow - Who Says What to Whom on Twitter

Twitter Flow - Who Says What to Whom on Twitter

Ratings: (0)|Views: 41 |Likes:
Published by Samvel Martirosyan

More info:

Categories:Types, Research
Published by: Samvel Martirosyan on Mar 29, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/13/2011

pdf

text

original

 
Who Says What to Whom on Twitter
Shaomei Wu
Cornell University, USA
sw475@cornell.eduJake M. Hofman
Yahoo! Research, NY, USA
hofman@yahoo-inc.comWinter A. Mason
Yahoo! Research, NY, USA
winteram@yahoo-inc.comDuncan J. Watts
Yahoo! Research, NY, USA
djw@yahoo-inc.com
ABSTRACT
We study several longstanding questions in media communi-cations research, in the context of the microblogging serviceTwitter, regarding the production, flow, and consumptionof information. To do so, we exploit a recently introducedfeature of Twitter—known as Twitter lists—to distinguishbetween elite users, by which we mean specifically celebri-ties, bloggers, and representatives of media outlets and otherformal organizations, and ordinary users. Based on this clas-sification, we find a striking concentration of attention onTwitter—roughly 50% of tweets consumed are generated byjust 20K elite users—where the media produces the most in-formation, but celebrities are the most followed. We also findsignificant homophily within categories: celebrities listen tocelebrities, while bloggers listen to bloggers etc; however,bloggers in general rebroadcast more information than theother categories. Next we re-examine the classical“two-stepflow” theory of communications, finding considerable sup-port for it on Twitter, but also some interesting differences.Third, we find that URLs broadcast by different categoriesof users or containing different types of content exhibit sys-tematically different lifespans. And finally, we examine theattention paid by the different user categories to differentnews topics.
Categories and Subject Descriptors
H.1.2 [
Models and Principles
]: User/Machine Systems;J.4 [
Social and Behavioral Sciences
]: Sociology
General Terms
two-step flow, communications, classification
Keywords
Communication networks, Twitter, information flow
Part of this research was performed while the author wasvisiting Yahoo! Research, New York.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
WWW 
’11 Hyderabad, IndiaCopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
1. INTRODUCTION
A longstanding objective of media communications re-search is encapsulated by what is known as Lasswell’s maxim:“who says what to whom in what channel with what ef-fect”[9], so-named for one of the pioneers of the field, HaroldLasswell. Although simple to state, Laswell’s maxim hasproven difficult to satisfy in the more-than 60 years sincehe stated it, in part because it is generally difficult to ob-serve information flows in large populations, and in partbecause different channels have very different attributes andeffects. As a result, theories of communications have tendedto focus either on “mass” communication, defined as “one-way message transmissions from one source to a large, rela-tively undifferentiated and anonymous audience,”or on“in-terpersonal” communication, meaning a “two-way messageexchange between two or more individuals.” [13].Correspondingly, debates among communication theoristshave tended to revolve around the relative importance of these two putative modes of communication. For exam-ple, whereas early theories such as the so-called“hypodermicmodel”posited that mass media exerted direct and relativelystrong effects on public opinion, mid-century researchers [10,6, 11, 4] argued that the mass media influenced the pub-lic only indirectly, via what they called a two-step flow of communications, where the critical intermediate layer wasoccupied by a category of media-savvy individuals called
opinion leaders
. The resulting “limited effects” paradigmwas then subsequently challenged by a new generation of researchers [5], who claimed that the real importance of themass media lay in its ability to set the agenda of publicdiscourse. But in recent years rising public skepticism of mass media, along with changes in media and communica-tion technology, have tilted conventional academic wisdomonce more in favor of interpersonal communication, whichsome identify as a “new era” of minimal effects [2].Recent changes in technology, however, have increasinglyundermined the validity of the mass vs. interpersonal di-chotomy itself. On the one hand, over the past few decadesmass communication has experienced a proliferation of newchannels, including cable television, satellite radio, special-ist book and magazine publishers, and of course an array of web-based media such as sponsored blogs, online communi-ties, and social news sites. Correspondingly, the traditionalmass audience once associated with, say, network televisionhas fragmented into many smaller audiences, each of whichincreasingly selects the information to which it is exposed,and in some cases generates the information itself. Mean-
 
while, in the opposite direction interpersonal communicationhas become increasingly amplified through personal blogs,email lists, and social networking sites to afford individu-als ever-larger audiences. Together, these two trends havegreatly obscured the historical distinction between mass andinterpersonal communications, leading some scholars to referinstead to “masspersonal” communications [13].Nowhere is the erosion of traditional categories more ap-parent than in the micro-blogging platform Twitter. To il-lustrate, the top ten most-followed users on Twitter are notcorporations or media organizations, but individual people,mostly celebrities. Moreover, these individuals communi-cate directly with their millions of followers, often managedby themselves or publicists, thus bypassing the traditionalintermediation of the mass media between celebrities andfans. Next, in addition to conventional celebrities, a newclass of“semi-public”individuals like bloggers, authors, jour-nalists, and subject matter experts have come to occupy animportant niche on Twitter, in some cases becoming moreprominent than traditional public figures such as entertain-ers and elected officials. Third, in spite of these shifts awayfrom centralized media power, media organizations—alongwith corporations, governments, and NGOs—all remain wellrepresented among highly followed users, and are often ex-tremely active. And finally, Twitter is primarily made upof many millions of users who seem to be ordinary individ-uals communicating with their friends and acquaintances ina manner largely consistent with traditional notions of in-terpersonal communication.Twitter, therefore, represents the full spectrum of commu-nications from personal and private to“masspersonal”to tra-ditional mass media. Consequently it provides an interestingcontext in which to address Lasswell’s maxim, especially asTwitter—unlike television, radio, and print media—enablesone to easily observe information flows among the membersof its ecosystem. Unfortunately, however, the kinds of ef-fects that are of most interest to communications theorists,such as changes in behavior, attitudes, etc., remain difficultto measure on Twitter. Therefore in this paper we limitour focus to the“who says what to whom”part of Laswell’smaxim.To this end, our paper makes three main contributions:
We introduce a method for classifying users using Twit-ter Lists into“eliteand“ordinary”users, further clas-sifying elite users into one of four categories of interest—media, celebrities, organizations, and bloggers.
We investigate the flow of information among thesecategories, finding that although audience attention ishighly concentrated on a minority of elite users, muchof the information they produce reaches the massesindirectly via a large population of intermediaries.
We find that different categories of users place slightlydifferent emphasis on different types of content, andthat different content types exhibit dramatically dif-ferent characteristic lifespans, ranging from less thana day to months.The remainder of the paper proceeds as follows. In thenext section, we review related work. In section 3 we dis-cuss our data and methods, including section 3.3 in whichwe describe how we use Twitter Lists to classify users, out-line two different sampling methods, and show that theydeliver qualitatively similar results. In section 4 we analyzethe production of information on Twitter, particularly whopays attention to whom. In section 4.1, we revisit the the-ory of the two-step flow—arguably the dominant theory of communications for much of the past 50 years—finding con-siderable support for the theory as well as some interestingdifferences. In section 5, we consider “who listens to what”,examining first who shares what kinds of media content, andsecond the lifespan of URLs as a function of their origin andtheir content. Finally, in section 6 we conclude with a brief discussion of future work.
2. RELATED WORK
Aside from the communications literature surveyed above,a number of recent papers have examined information dif-fusion on Twitter. Kwak et al. [8] studied the topologicalfeatures of the Twitter follower graph, concluding from thehighly skewed nature of the distribution of followers and thelow rate of reciprocated ties that Twitter more closely resem-bled an information sharing network than a social network—a conclusion that is consistent with our own view. In ad-dition, Kwak et al. compared three different measures of influence—number of followers, page-rank, and number of retweets—finding that the ranking of the most influentialusers differed depending on the measure. In a similar vein,Cha et al. [3] compared three measures of influence—numberof followers, number of retweets, and number of mentions—and also found that the most followed users did not neces-sarily score highest on the other measures. Weng et al. [15]compared number of followers and page rank with a modifiedpage-rank measure which accounted for topic, again findingthat ranking depended on the influence measure. Finally,Bakshy et al. [1] studied the distribution of retweet cascadeson Twitter, finding that although users with large followercounts and past success in triggering cascades were on aver-age more likely to trigger large cascades in the future, thesefeatures are in general poor predictors of future cascade size.Our paper differs from this earlier work by shifting atten-tion from the ranking of individual users in terms of variousinfluence measures to the flow of information among dif-ferent categories of users. In particular, we are interestedin identifying “elite” users, who we differentiate from “ordi-nary” users in terms of their visibility, and understandingtheir role in introducing information into Twitter, as well ashow information originating from traditional media sourcesreaches the masses.
3. DATA AND METHODS3.1 Twitter Follower Graph
In order to understand how information is transmitted onTwitter, we need to know the channels by which it flows;that is, who is following whom on Twitter. To this end, weused the follower graph studied by Kwak et al. [8], whichincluded 42M users and 1.5B edges. This data representsa crawl of the graph seeded with all users on Twitter asobserved by July 31st, 2009, and is publicly available
1
. Asreported by Kwak et al. [8], the follower graph is a directednetwork characterized by highly skewed distributions both
1
The data is free to download fromhttp://an.kaist.ac.kr/traces/WWW2010.html
 
of in-degree (# followers) and out-degree (#“friends”, Twit-ter notation for how many others a user follows); however,the out-degree distribution is even more skewed than thein-degree distribution. In both friend and follower distribu-tions, for example, the median is less than 100, but the max-imum # friends is several hundred thousand, while a smallnumber of users have millions of followers. In addition, thefollower graph is also characterized by extremely low reci-procity (roughly 20%)—in particular, the most-followed in-dividuals typically do not follow many others. The Twitterfollower graph, in other words, does not conform to the usualcharacteristics of social networks, which exhibit much higherreciprocity and far less skewed degree distributions [7], butinstead resembles more the mixture of one-way mass com-munications and reciprocated interpersonal communicationsdescribed above.
3.2 Twitter Firehose
In addition to the follower graph, we are interested in thecontent being shared on Twitter—particularly URLs—andso we examined the corpus of all 5B tweets generated overa 223 day period from July 28, 2009 to March 8, 2010 us-ing data from the Twitter “firehose,” the complete streamof all tweets
2
. Because our objective is to understand theflow of information, it is useful for us to restrict attention totweets containing URLs, for two reasons. First, URLs addeasily identifiable tags to individual tweets, allowing us toobserve when a particular piece of content is either retweetedor subsequently reintroduced by another user. And second,because URLs point to online content outside of Twitter,they provide a much richer source of variation than is pos-sible in the typical 140 character tweet. Finally, we notethat almost all URLs broadcast on Twitter have been short-ened using one of a number of URL shorteners, of which themost popular is http://bit.ly/. From the total of 5B tweetsrecorded during our observation period, therefore, we focusour attention on the subset of 260M containing bit.ly URLs.
3.3 Twitter Lists
Our method for classifying users exploits a relatively re-cent feature of Twitter: Twitter Lists. Since its launch onNovember 2, 2009, Twitter Lists have been welcomed by thecommunity as a way to group people and organize one’s in-coming stream of tweets by specific sets of users. To createa Twitter List, a user needs to provide a name (required)and description (optional) for the list, and decide whetherthe new list is public (anyone can view and subscribe to thislist) or private (only the list creator can view or subscribe tothis list). Once a list is created, the user can add/edit/deletelist members. As the purpose of Twitter Lists is to help usersorganize users they follow, the name of the list can be con-sidered a meaningful label for the listed users. List creationtherefore effectively exploits the “wisdom of crowds” [12]to the task of classifying users, both in terms of their im-portance to the community (number of lists on which theyappear), and also how they are perceived (e.g. news organi-zation vs. celebrity, etc.).Before describing our methods for classifying users in termsof the lists on which they appear, we emphasize that weare motivated by a particular set of substantive questionsarising out of communications theory. In particular, we
2
http://dev.twitter.com/doc/get/statuses/firehoseare interested in the relative importance of mass commu-nications, as practiced by media and other formal organiza-tions, masspersonal communications as practiced by celebri-ties and prominent bloggers, and interpersonal communica-tions, as practiced by ordinary individuals communicatingwith their friends. In addition, we are also interested in therelationships between these categories of users, motivatedby theoretical arguments such as the theory of the two-stepflow [6]. Rather than pursuing a strategy of automatic clas-sification, therefore, our approach depends on defining andidentifying certain predetermined classes of theoretical in-terest, where both approaches have advantages and disad-vantages. In particular, we restrict our attention to fourclasses of what we call“eliteusers: media, celebrities, orga-nizations, and bloggers, as well as the relationships betweenthese elite users and the much larger population of “ordi-nary” users.In additional to these theoretically-imposed constraints,our proposed classification method must also satisfy a prac-tical constraint—namely that the rate limits established byTwitter’s API effectively preclude crawling all lists for allTwitter users
3
. Thus we instead devised two different sam-pling schemes—a snowball sample and an activity sample—each with some advantages and disadvantages, discussed be-low.
3.3.1 Snowball sample of Twitter Lists
The first method for identifying elite users employed snow-ball sampling. For each category, we chose a number
u
0
of seed users that were highly representative of the desired cat-egory and appeared on many category-related lists. For eachof the four categories above, the following seeds were chosen:
Celebrities
: Barack Obama, Lady Gaga, Paris Hilton
Media
: CNN, New York Times
Organizations
: Amnesty International, World WildlifeFoundation, Yahoo! Inc., Whole Foods
Blogs
4
: BoingBoing, FamousBloggers, problogger, mash-able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,dragonblogger, bbrian017, hishaman, copyblogger, en-gadget, danielscocco, BlazingMinds, bloggersblog, Ty-coonBlogger, shoemoney, wchingya, extremejohn,GrowMap, kikolani, smartbloggerz, Element321, bran-donacox, remarkablogger, jsinkeywest, seosmarty, No-tAProBlog, kbloemendaal, JimiJones, ditescoAfter reviewing the lists associated with these seeds, thefollowing keywords were hand-selected based on (a) theirrepresentativeness of the desired categories; and (b) theirlack of overlap between categories:
3
The Twitter API allows only 20K calls per hour, where atmost 20 lists can be retrieved for each API call. Under themodest assumption of 40M users (roughly the number in the2009 crawl by [8]), where each user is included on at most20 lists, this would require 4
10
6
/
2
10
3
= 2
,
000 hours, or11 weeks. Clearly this time could be reduced by deployingmultiple accounts, but it also likely underestimates the realtime quite significantly, as many users appear on many morethan 20 lists (e.g. Lady Gaga appears on nearly 140,000)
4
The blogger category required many more seeds becausebloggers are in general lower profile than the seeds for theother categories

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->