network characterized by highly skewed distributions both of in-degree (# followers) and out-degree (#“friends”, Twitternomenclature for how many others a user follows); however,the out-degree distribution is even more skewed than thein-degree distribution. In both friend and follower distribu-tions, for example, the median is less than 100, but the max-imum # friends is several hundred thousand, while a smallnumber of users have millions of followers. In addition, thefollower graph is also characterized by extremely low reci-procity (roughly 20%)—in particular, the most-followed in-dividuals typically do not follow many others. The Twitterfollower graph, in other words, does not conform to the usualcharacteristics of social networks, which exhibit much higherreciprocity and far less skewed degree distributions , butinstead resembles more the mixture of one-way mass com-munications and reciprocated interpersonal communicationsdescribed above.
3.2 Twitter Firehose
In addition to the follower graph, we are interested in thecontent being shared on Twitter, and so we examined thecorpus of all 5B tweets generated over a 223 day period fromJuly 28, 2009 to March 8, 2010 using data from the Twitter“ﬁrehose,” the complete stream of all tweets
. Because ourobjective is to understand the ﬂow of information, it is use-ful for us to restrict attention to tweets containing URLs,for two reasons. First, URLs add easily identiﬁable tags toindividual tweets, allowing us to observe when a particularpiece of content is either retweeted or subsequently reintro-duced by another user. And second, because URLs pointto online content outside of Twitter, they provide a muchricher source of variation than is possible in the typical 140character tweet
. Finally, we note that almost all URLsbroadcast on Twitter have been shortened using one of anumber of URL shorteners, of which the most popular ishttp://bit.ly/. From the total of 5B tweets recorded duringour observation period, therefore, we focus our attention onthe subset of 260M containing bit.ly URLs; thus all subse-quent counts are implicitly understood to be restricted tothis content.
3.3 Twitter Lists
Our method for classifying users exploits a relatively re-cent feature of Twitter: Twitter Lists. Since its launch onNovember 2, 2009, Twitter Lists have been used extensivelyto group sets of users into topical or other categories, andthereby to better organize and/or ﬁlter incoming tweets. Tocreate a Twitter List, a user provides a name (required) anddescription (optional) for the list, and decides whether thenew list is public (anyone can view and subscribe to this list)or private (only the list creator can view or subscribe to thislist). Once a list is created, the user can add/edit/deletelist members. As the purpose of Twitter Lists is to helpusers organize users they follow, the name of the list canbe considered a meaningful label for the listed users. The
Naturally, this restriction also has downsides, in particularthat some users may be more likely to include URLs in theirtweets than others, and thus will appear to be relativelymore active and/or have more impact than if we were insteadto consider all tweets. For our purposes, however, we believethat the practical advantages of the restriction outweigh thepotential for bias.classiﬁcation of users can therefore eﬀectively exploit the“wisdom of crowds” with these created lists, both in termsof their importance to the community (number of lists onwhich they appear), and also how they are perceived (e.g.news organization vs. celebrity, etc.).Before describing our methods for classifying users in termsof the lists on which they appear, we emphasize that weare motivated by a particular set of substantive questionsarising out of communications theory. In particular, weare interested in the relative importance of mass commu-nications, as practiced by media and other formal organiza-tions, masspersonal communications as practiced by celebri-ties and prominent bloggers, and interpersonal communica-tions, as practiced by ordinary individuals communicatingwith their friends. In addition, we are interested in the re-lationships between these categories of users, motivated bytheoretical arguments such as the theory of the two-stepﬂow . Rather than pursuing a strategy of automatic clas-siﬁcation, therefore, our approach depends on deﬁning andidentifying certain predetermined classes of theoretical in-terest, where both approaches have advantages and disad-vantages. In particular, we restrict our attention to fourclasses of what we call“elite”users: media, celebrities, orga-nizations, and bloggers, as well as the relationships betweenthese elite users and the much larger population of “ordi-nary”users.Analytically, our approach has some disadvantages. Inparticular, by determining the categories of interest in ad-vance, we reduce the possibility of discovering unanticipatedcategories that may be of equal or greater relevance thanthose we selected. Thus although we believe that for our par-ticular purposes, the advantages of our approach—namelyconceptual clarity and ease of interpretation—outweigh thedisadvantages, automated classiﬁcation methods remain aninteresting topic for future work. Finally, in addition tothese theoretically-imposed constraints, our proposed clas-siﬁcation method must also satisfy a practical constraint—namely that the rate limits established by Twitter’s APIeﬀectively preclude crawling all lists for all Twitter users
.Thus we instead devised two diﬀerent sampling schemes—asnowball sample and an activity sample—each with someadvantages and disadvantages, discussed below.
3.3.1 Snowball sample of Twitter Lists
The ﬁrst method for identifying elite users employed snow-ball sampling. For each category, we chose a number
of seed users that were highly representative of the desired cat-egory and appeared on many category-related lists. For eachof the four categories above, the following seeds were chosen:
: Barack Obama, Lady Gaga, Paris Hilton
: CNN, New York Times
: Amnesty International, World WildlifeFoundation, Yahoo! Inc., Whole Foods
The Twitter API allows only 20K calls per hour, where atmost 20 lists can be retrieved for each API call. Under themodest assumption of 40M users, where each user is includedon at most 20 lists, this would require roughly 11 weeks.Clearly this time could be reduced by deploying multipleaccounts, but it also likely underestimates the real time quitesigniﬁcantly, as many users appear on many more than 20lists (e.g. Lady Gaga appears on nearly 140,000).