on microblogging data, and the ﬁrst one that speciﬁcally examinesdifferences across different countries.The rest of the paper is organized as follows. In Section 2 wegive a brief overview of related work and in Section 3 we describeour data set. In Section 4 we describe the distribution of languagesused in each of the ten most active countries in our data set, andthe main ﬁndings of our analysis on the level of happiness in eachcountry. Section 5 focuses on the content of the tweets and network structure, and we conclude by summarizing our main ﬁndings insections 6 and 7.
2. RELATED WORK
The structure of social networks has been studied extensively be-cause structure is strongly related to the detection of communitiesand to how information propagates. Mislove et al. , for exam-ple, studied basic characteristics of the structure of Flickr, Orkut,LiveJournal, and YouTube, and found Power-law, small-world, andscale free properties. The authors argue that the ﬁndings are usefulin informing the design of social network-based systems. Kwak etal. , examined the Twitter network aiming to determine it’s ba-sic characteristics. One of their main ﬁndings is that Twitter doesnot properly exhibit a “traditional" social network structure sinceit lacks reciprocity (only 22% of all connections on Twitter werefound to be reciprocal), so it behaves more like news media, fa-cilitating quick propagation of news. Java et al. , on the otherhand, studied the topological and geographical properties of Twit-ter’s social network and observed that there is high reciprocity andthe tendency for users to participate in communities of commoninterest, and to share personal information. Onnela et al. ,present a study on a large-scale network of mobile calls and textmessages. They found no relationship between topological central-ity and physical centrality of nodes in the network, and examineddifferences among big and small communities.One of the key questions relating to communities and network structureisinﬂuence. DeChoudhuryetal. examineTwitterdataand study how different sampling methods can inﬂuence the levelof diffusion of information. They found that sampling techniquesincorporating context (activity or location) and topology have bet-ter diffusion than if only context or topology are considered. Theyalso observed the presence of homophily, showing that users gettogether with “similar" users, but that the diffusion of tweets alsodepends on topics. Cha et al.  studied the in-degree and out-degree of the Twitter network and observed that inﬂuence is in factnot related to the number of followers, but that having active follo-wers who retweet or mention the user is more important.
3. DATASET DESCRIPTION
Twitter is a platform which allows users to choose between keep-ing their proﬁles and activity (
) public or private. Users withprivate proﬁles make their information available only to a selectedgroup of friends. For obvious reasons, we limit our research onlyto information provided by users with public proﬁles
.The focus of our research is mostly on characterizing large on-linesocialnetworks, basedonusergeographicallocation, forwhichTwitterprovideslimitedinformation. Therefore, weperformanini-tial ﬁlter of users based on activity and proﬁle information. First,we choose users which we determine to be
. For this we ex-amine a 10-day continuous time window of user activity, selectingday-1 randomly from the year 2010. Then, we consider
All processing was anonymous and aggregated. No personal userinformation was usedactive, users which generated tweets during this time frame. Sec-ondly, weﬁlteredtheresultinguserstokeeponlyactiveuserswhichhad also entered a
into their proﬁles during this sametime period. We considered as a valid location any text which couldbe parsed correctly into latitude and longitude (using the Yahoo!public PlaceMakerAPI
). Itshould benoted, thatwe performed ba-sically a static analysis, so we did not consider user mobility duringthis period. We did not process location information which was au-tomatically generated for the user with a GPS device on their clientapplication, based on the fact that location changes continuouslywith the user. Since in this work we are interested more in charac-terizing geographical communities of users, we decided to use thelocation which reﬂects more accurately the user’s
.Using this criteria, we obtained a set of
active userswith valid location information, which were divided into
diffe-rent countries. For the rest of our analysis, we selected the Top-
countries with more activity and gathered all of the tweets gener-ated by these users for the entire duration of 2010. In total ourworking dataset consisted of
of the initial10-day user sample), and
tweets. Figure 1 showsthe distribution of the users in our dataset into the top-10 countries,and the activity that they generated for 2010. Note that the amountof activity registered for each country is not necessarily propor-tional to the number of users. This is explicitly shown in Figure 2,which displays the tweet/user ratio for each country. This ratio isindependent of the number of users in each local network.
4. LANGUAGES AND SENTIMENT
To analyze the language in which tweets are writ-ten, we classify each tweet using proprietary software. As a result,99.05% of the tweets were classiﬁed into 69 languages. The 10most popular languages are shown in Figure 3. English is the mostpopular language, and it corresponds to nearly 53% of the tweets.Additionally, Figure 4 shows the three most common languages foreach of the top-10 countries, as well as the percentage of tweetswhich correspond to these languages. It is worth noting that En-glish is one of the three most frequently used languages for thesecountries, and for the Netherlands, Indonesia, and Mexico morethan 10% of tweets are in English, while for Brazil it is 9%. Addi-tionally, a special consideration should be taken for the languagesof Italian and Catalan, which appear in Figure 3 and Figure 4. Thisis strange ﬁnding given the fact that Italy is not considered in thetop-10 countries of our study and the number of people who speak Catalan world-wide is very small. By sampling the tweets for Cata-lan and Italian we ﬁnd that many of them correspond to false posi-tives given by our classiﬁer, since they actually correspond to Por-tuguese and Spanish. The high resemblance of these languages,in addition to the common use of slang, along with misspellings,makes automatic language identiﬁcation particularly challenging.
We also analyzed the sentiment componentof tweets, for this we use the measure of
as coined byDodds et al. , which is also more commonly referred to as
. This value represents the psychological reaction which hu-mans have to a speciﬁc word, according to a scale which rangesfrom “happy” to “unhappy”. In particular, we analyze the happi-ness levels for each of the top-10 countries, considering only tweetsclassiﬁedasEnglishandSpanish. Toachievethis, weusedthe1999Affective Norms for English Words (ANEW) list by Bradley andLang  for English tweets, and for Spanish, we used its adapta-tion by Redondo et al. . The ANEW list contains 1,034 words