of in-degree (# followers) and out-degree (#“friends”, Twit-ter notation for how many others a user follows); however,the out-degree distribution is even more skewed than thein-degree distribution. In both friend and follower distribu-tions, for example, the median is less than 100, but the max-imum # friends is several hundred thousand, while a smallnumber of users have millions of followers. In addition, thefollower graph is also characterized by extremely low reci-procity (roughly 20%)—in particular, the most-followed in-dividuals typically do not follow many others. The Twitterfollower graph, in other words, does not conform to the usualcharacteristics of social networks, which exhibit much higherreciprocity and far less skewed degree distributions , butinstead resembles more the mixture of one-way mass com-munications and reciprocated interpersonal communicationsdescribed above.
3.2 Twitter Firehose
In addition to the follower graph, we are interested in thecontent being shared on Twitter—particularly URLs—andso we examined the corpus of all 5B tweets generated overa 223 day period from July 28, 2009 to March 8, 2010 us-ing data from the Twitter “ﬁrehose,” the complete streamof all tweets
. Because our objective is to understand theﬂow of information, it is useful for us to restrict attention totweets containing URLs, for two reasons. First, URLs addeasily identiﬁable tags to individual tweets, allowing us toobserve when a particular piece of content is either retweetedor subsequently reintroduced by another user. And second,because URLs point to online content outside of Twitter,they provide a much richer source of variation than is pos-sible in the typical 140 character tweet. Finally, we notethat almost all URLs broadcast on Twitter have been short-ened using one of a number of URL shorteners, of which themost popular is http://bit.ly/. From the total of 5B tweetsrecorded during our observation period, therefore, we focusour attention on the subset of 260M containing bit.ly URLs.
3.3 Twitter Lists
Our method for classifying users exploits a relatively re-cent feature of Twitter: Twitter Lists. Since its launch onNovember 2, 2009, Twitter Lists have been welcomed by thecommunity as a way to group people and organize one’s in-coming stream of tweets by speciﬁc sets of users. To createa Twitter List, a user needs to provide a name (required)and description (optional) for the list, and decide whetherthe new list is public (anyone can view and subscribe to thislist) or private (only the list creator can view or subscribe tothis list). Once a list is created, the user can add/edit/deletelist members. As the purpose of Twitter Lists is to help usersorganize users they follow, the name of the list can be con-sidered a meaningful label for the listed users. List creationtherefore eﬀectively exploits the “wisdom of crowds” to the task of classifying users, both in terms of their im-portance to the community (number of lists on which theyappear), and also how they are perceived (e.g. news organi-zation vs. celebrity, etc.).Before describing our methods for classifying users in termsof the lists on which they appear, we emphasize that weare motivated by a particular set of substantive questionsarising out of communications theory. In particular, we
http://dev.twitter.com/doc/get/statuses/ﬁrehoseare interested in the relative importance of mass commu-nications, as practiced by media and other formal organiza-tions, masspersonal communications as practiced by celebri-ties and prominent bloggers, and interpersonal communica-tions, as practiced by ordinary individuals communicatingwith their friends. In addition, we are also interested in therelationships between these categories of users, motivatedby theoretical arguments such as the theory of the two-stepﬂow . Rather than pursuing a strategy of automatic clas-siﬁcation, therefore, our approach depends on deﬁning andidentifying certain predetermined classes of theoretical in-terest, where both approaches have advantages and disad-vantages. In particular, we restrict our attention to fourclasses of what we call“elite”users: media, celebrities, orga-nizations, and bloggers, as well as the relationships betweenthese elite users and the much larger population of “ordi-nary” users.In additional to these theoretically-imposed constraints,our proposed classiﬁcation method must also satisfy a prac-tical constraint—namely that the rate limits established byTwitter’s API eﬀectively preclude crawling all lists for allTwitter users
. Thus we instead devised two diﬀerent sam-pling schemes—a snowball sample and an activity sample—each with some advantages and disadvantages, discussed be-low.
3.3.1 Snowball sample of Twitter Lists
The ﬁrst method for identifying elite users employed snow-ball sampling. For each category, we chose a number
of seed users that were highly representative of the desired cat-egory and appeared on many category-related lists. For eachof the four categories above, the following seeds were chosen:
: Barack Obama, Lady Gaga, Paris Hilton
: CNN, New York Times
: Amnesty International, World WildlifeFoundation, Yahoo! Inc., Whole Foods
: BoingBoing, FamousBloggers, problogger, mash-able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,dragonblogger, bbrian017, hishaman, copyblogger, en-gadget, danielscocco, BlazingMinds, bloggersblog, Ty-coonBlogger, shoemoney, wchingya, extremejohn,GrowMap, kikolani, smartbloggerz, Element321, bran-donacox, remarkablogger, jsinkeywest, seosmarty, No-tAProBlog, kbloemendaal, JimiJones, ditescoAfter reviewing the lists associated with these seeds, thefollowing keywords were hand-selected based on (a) theirrepresentativeness of the desired categories; and (b) theirlack of overlap between categories:
The Twitter API allows only 20K calls per hour, where atmost 20 lists can be retrieved for each API call. Under themodest assumption of 40M users (roughly the number in the2009 crawl by ), where each user is included on at most20 lists, this would require 4
000 hours, or11 weeks. Clearly this time could be reduced by deployingmultiple accounts, but it also likely underestimates the realtime quite signiﬁcantly, as many users appear on many morethan 20 lists (e.g. Lady Gaga appears on nearly 140,000)
The blogger category required many more seeds becausebloggers are in general lower proﬁle than the seeds for theother categories