Happiness and the Patterns of Life: A Study of Geolocated Tweets

Computational Story Lab, Department of Mathematics and Statistics, Vermont Complex Systems Center, Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America ∗ chris.danforth@uvm.edu (corresponding author)

Morgan R. Frank, Lewis Mitchell, Peter S. Dodds, Christopher M. Danforth∗

The patterns of life exhibited by large populations have been described and modeled both as a basic science exercise and for a range of applied goals such as reducing automotive congestion, improving disaster response, and even predicting the location of individuals. However, these studies previously had limited access to conversation content, rendering changes in expression as a function of movement invisible. In addition, they typically use the communication between a mobile phone and its nearest antenna tower to infer position, limiting the spatial resolution of the data to the geographical region serviced by each cellphone tower. We use a collection of 37 million geolocated tweets to characterize the movement patterns of 180,000 individuals, taking advantage of several orders of magnitude of increased spatial accuracy relative to previous work. Employing the recently developed sentiment analysis instrument known as the hedonometer, we characterize changes in word usage as a function of movement, and find that expressed happiness increases logarithmically with distance from an individual’s average location. A proper characterization of human mobility patterns [1– 16] is an essential component in the development of models of urban planning [17], traffic forecasting [18], and the spread of diseases [19–21]. In the modern communication era, patterns of human movement have been revealed at an increasingly higher resolution in both space and time, with mobile phone data in particular complementing existing survey-based investigations. As is the case with each new instrument measuring macroscale sociotechnical phenomena, the task has become one of understanding what discernible patterns exist, and what meaning can be derived from those patterns [3, 22–24]. Scientists working to understand mobility have employed a diverse set of methodologies. Brockman et al. [7] used the circulation of nearly 1/2 million U.S. dollar bills whose locations were submitted by over 1 million visitors to a website [25] to demonstrate that bank note trajectories are superdiffusive in space and subdiffusive in time, i.e. moving farther and more frequently than expected. Gonzalez et al. [2] used 6 months of mobile phone data from 100,000 individuals to show that human trajectories are regular in space and time, with each individual having a high probability of returning to a few preferred locations according to Zipf’s law. Combining phone communication data with measures of community economic prosperity, Eagle et al. [3] showed that the diversity of contacts in an individual’s social network is strongly correlated to the potential for economic development exhibited by their community. Exemplifying recent work to characterize sentiment with social network communications, Mitchell et al. [26] combined 1

traditional survey data (e.g., Gallup) with millions of tweets to correlate word usage with the demographic characteristics of U.S. urban areas. Expressed happiness was shown, for example, to correlate strongly with percentage of the population married, and anti-correlate with obesity. Words such as ‘McDonalds’ and ‘hungry’ appeared far more frequently in obese cities, suggesting their instrument could be used to provide real-time feedback on social health programs such as the proposed ban on the sale of large sodas in New York City in 2013. In what follows, we characterize the pattern of life of over 180,000 individuals in the U.S. using messages sent via the social networking service Twitter, and employ our text-based hedonometer [27] to characterize sentiment as a function of movement. In the calendar year 2011, we collected roughly 4 billion messages through Twitter’s gardenhose feed, representing a random 10% of all status updates posted during this period. Along with an abundance of other metadata, location information typically accompanies each message, resulting from one of three mechanisms by which individuals can report their location when updating their status. First, when an individual registers their account with Twitter, they are presented with the opportunity to report their location in a free text box. This region will be displayed in their user profile (e.g. ‘NYC’ or ‘over the rainbow’). The metadata accompanying each tweet sent by the individual contains this self-reported location. Second, individuals submitting a message through a web browser can choose to tag their message with a ‘place’ chosen from a drop-down menu, where the first option provided is typically the city within which the computer’s IP address is found. For the purposes of accuracy, we have chosen to ignore each of these two mechanisms for reporting position when attempting to assign each tweet a geographical location, and focus instead on messages located via a third mechanism, namely the Global Positioning System (GPS). Individuals using an app provided by a mobile device may opt-in to geolocate their message, in which case the exact latitude and longitude of the mobile phone is reported. The accuracy of this information is governed by the precision of the GPS instrument embedded in the phone, which can vary depending on the surrounding topography. As a result of these factors, we are able to approximately place each geolocated message inside a 10 meter circle on the surface of the Earth, within which the tweet was sent. Approximately 1% of the status updates received through the gardenhose feed are geolocated, resulting in a total of 37 million messages, collectively representing more than 180,000 English-speaking people worldwide. Fig. 1 is representative of the geospatial resolution of the data. Results Following Gonz´ alez et al. [2], we examine the shape of human mobility using radius of gyration, hereafter gyradius, as a measure of the linear size occupied by an individual’s trajectory. In Fig. 2, we investigate the geographical distribution of movement in four urban areas by plotting a dot for each tweet, colored by the gyradius of its author. Clockwise from the top left, cities are displayed in order of their apparent aggregate

arXiv:1304.1296v1 [physics.soc-ph] 4 Apr 2013

2

Figure 1: Each point corresponds to a geolocated tweet posted in 2011. Twitter activity is most apparent in urban areas. Note that the image contains no cartographic borders, simply a small dot for each message. Legend: A (U.S.), B (Washington, D.C.), C (Los Angeles, C.A.), and D (Earth).

Figure 2: (Color online) The gyradius, calculated for each individual, is shown for each tweet authored in four example cities. Reflecting the pattern of urban life, we find messages authored by large radius individuals to be more likely to appear in the main downtown area of each city, while messages authored by small radius individuals tend to appear outside of the main urban area. Histograms of gyradii for each city are shown in Fig. S1, along with tweet locations colored by distance from expected location (Fig. S2). Note that higher resolution versions of the four panels above can be found online [28].

3

gyradius, with New York City seemingly exhibiting a smaller radius than the San Francisco Bay Area. Reflecting the pattern of urban life, we find messages authored by large radius individuals to be more likely to appear in the main downtown area of each city, while messages authored by small radius individuals tend to appear in less densely populated areas. For example, in Chicago, many individuals writing from downtown exhibit an order of magnitude greater radius than individuals posting in areas outside of the city. In the greater Los Angeles area, we see several clusters of individuals with larger radius in downtown Los Angeles, as well as Long Beach, Santa Monica, and Disneyland in Anaheim, while less densely populated areas are seen as smaller clusters exhibiting much smaller radii. The San Francisco Bay Area is clearly revealed by individuals with large radius, most notably in San Francisco, and somewhat less so in Oakland and San Jose. Outside of these cities, there are many suburban areas revealed by individuals with large radius, e.g. Palo Alto. Tweets appearing in less densely populated Bay Area locations are far more likely to be authored by large radius individuals than those appearing in lower population areas elsewhere. This observation surely reflects the socio-economic and demographic characteristics of individuals using Twitter in the Bay Area, where the social network service was founded. Additionally, it could reflect the presence of tourists who will typically have a larger radius than someone who lives and works in the Bay Area. The main observation apparent in Fig. 2, namely that individuals who move a lot tend to appear in areas of large population density, is somewhat counterintuitive. Given the apparent economies of scale offered by living in a densely populated area, one might expect to observe the inverse relationship, namely that people living in less densely populated areas travel further, by necessity, to their place of employment or grocery store, for example. Of course, these individuals with large radius could be tourists, or they could have a long commute. Looking at each point colored instead by distance from expected location (Fig. S2)) we still see more exaggerated segregation, with non-natives appearing predominantly in cities, and native individuals tweeting in the suburbs. Looking at 500 cities in the U.S., we find a moderate correlation between the mean gyradius and city land area (Pearson ρ = 0.24, p = 2 × 10−7 ); Fig. S3 and Table S1 show the top and bottom cities with respect to gyradii. To investigate the shape of human mobility, we normalize each individual’s trajectory to a common reference frame (see Methods). In Fig. 3, we plot a heat map of the probability density function of the normalized locations of all individuals. For the purposes of the discussion, we will refer to deviations from an individual’s expected location in the normalized reference frame as occurring in the directions north, south, east, and west. Several features of the map reveal interesting patterns of movement. First, the overall west-to-east teardrop shape of the contours demonstrates that people travel predominantly along their principle axis, namely heading west from the origin along y/σy = 0, with deviations in the orthogonal direction becoming shorter and less frequent as they move farther away from the 4

origin. Second, the appearance of two spatially distinct yellow regions separated by a less populated green region suggests that people spend the vast majority of their time near two locations. We refer to these locations as the work and home habitats [8], where the home habitat is centered on the dark red region roughly 1 standard deviation east of the origin, and the work habitat is centered approximately 2 standard deviations west of the origin. These locations highlight the bimodal distribution of principal axis corridor messages (Fig. 4A). Finally, a clear asymmetry is observed about the x/σx = 0 axis indicating the increasingly isotropic variation in movement surrounding the home habitat, as compared to the work habitat. We interpret this to be a reflection of the tendency to be more familiar with the surroundings of one’s home, and to explore these surroundings in a more social context (Fig. 4B). The symmetry observed when reflecting about the y/σy = 0axis is strong, demonstrating the remarkable consistency of the movement patterns revealed by the data. In an effort to characterize the temporal and spatial structure observed in Fig. 3, in Fig. 5 we examine locations frequently visited by the most prolific members of our data set, namely the roughly 300 individuals for whom we received at least 800 geolocated messages. We suspect that these individuals enabled the geolocating feature to be on by default for all messages, as implied by the roughly O(104 ) geolocated messages suggested by the gardenhose rate. The main figure shows the probability of tweeting from each habitat, with habitats ordered by rank, for each individual [8]. We find that (a) (a) P(Hi ) ∝ R(Hi )−1.3 which is approximately a Zipf distribution [29]. This finding indicates that regardless of the number of tweet habitats for a given individual, the majority of their messaging activity occurs in one of only a few habitats, with the probability decaying at a predictable rate. If the decay were Zipfian, an individual is approximately n-times as likely to tweet from their mode location than from their rank n location. With our slope being steeper, these probabilities fall at a faster rate with rank. Note that fitting the power law model to the leading 10 habitats, using only individuals who have at least 10 habitats, we also get a slope of −1.3. For roughly 95% of these individuals, each tweet has a greater than 10% chance of being authored from their mode location (Fig. 5B). Fig. 5C demonstrates each individual’s likelihood of authoring messages from their mode location (black curve) at different times of day throughout the week. A period2 cycle is observed for each day of the week. Maxima are seen in the morning (8-10am) and evening (10pm-midnight), and minima in the afternoon (2-4pm) and overnight (2-4am) hours. The peak in the morning is consistently higher than that in the evening, and the afternoon valley is consistently lower than the overnight valley. The cycle is somewhat less structured on the weekend. Also plotted are the probabilities of tweeting from locations other than the mode (red curve). While the shape is quite similar to the mode location probability, we do note that individuals tweeting at 2am are likely to be anywhere but home. In a study performed with cellphone tower data, Gonz´ alez

8 −2.5 6 −3 4 −3.5 2 −4 y/ y 0 −4.5 −2 −5 −4 −5.5 −6 −6 −8 −20 −15 −10 −5 x/ 0
x

5

10

Figure 3: (Color online) The probability density function of observing an individual in their normalized reference frame, where the origin corresponds to each individual’s expected location, and σy = 0 corresponds to their principle axis. This map shows the positions of over 37,000 individuals, each with more than 50 locations, in their intrinsic reference frame.

5

6 5 Log−10 Counts 4 3 2 1 0 −15
y

0.4

Slope=−0.071

A

0.35 0.3 0.25 0.2 0.15 0.1
−9 −3 0 x/ x 3 9 15
x

/

B
1 2 3 Log−10 Radius of Gyration (km) 4

0.05 0

y 30 x Figure 5: Looking at messages authored in the principle axis corridor, defined by | s | < 1000 , 15  s  15, we x y 30 y | < , we observe a clear separation Figure 4: Looking at messages authored in the principle axis corridor, defined by | σy 1000The distribution is skewed observe a clear separation between the most likely and second most likely position (A). left, between the most likely and second most likely position (A). The distribution is skewed left, with movement in a heading with movement in a heading opposite an individual’s work/home corridor observed to be highly unlikely. In addition, opposite annormalization, individual’s work/home corridor observed to much be highly unlikely. due toeast the normalization, we see that due to the we see that individuals are more likely In to addition, tweet slightly of their expected location individuals are much more likely to tweet slightly east of their expected location than slightly west. The isotropy ratio (B) meathan slightly west. The isotropy ratio (B) measures the change in the density’s shape as a function of gyradius, with sures the change in the density’s shape as a function of gyradius, with large radius individuals exhibiting a less circular pattern large radius individuals exhibiting a less circular pattern of life. Standard errors are plotted, but are only visible for the of life. Standard errors are plotted, but are only visible for the largest radius group. The isotropy ratio decays logarithmically largest radius group. The isotropy ratio decays logarithmically with radius. with radius.

et [2] found that people spend most smaller of their time inThe twoSan loasal. smaller clusters exhibiting much radii. cations, and a person’s probability of being found at a separate Francisco Bay Area is clearly revealed by individuals with location diminishes with visitation.and While our large radius, mostrapidly notably in rank San by Francisco, someinvestigation reveals a similar pattern, we find a larger differwhat less so in Oakland and San Jose. Outside of these ence inthere the probability an individual is tweeting the cities, are manythat suburban areas revealed byfrom individhome habitat than from the work habitat. We attribute these uals with large radius, e.g. Palo Alto. Tweets appearslight differences in our results to the different spatiotempoing in less densely populated Bay Area locations are far ral precision of location data, as well as differences in activimore likely to be large radius than ties represented byauthored the data. by Gonz´ alez et al. individuals determined each those appearing in lower population areas elsewhere. This individual’s location by continuously monitoring the nearest observation surely reflects socio-economic and democellphone tower whose rangethe they were within. As such, we graphicmore characteristics of individuals using Twitter in inthe receive precise location information, but only when dividuals performed the act of network tweeting.service was founded. Bay Area, where the social One major advantage of using Twitter data of to tourists study moveAdditionally, it could reflect the presence who ment is the additional source of information provided by the will typically have a larger radius than someone who lives messages themselves. Researchers using mobile phone data to and works in the Bay Area. Note that the relationship becharacterize mobility patterns do not have access to conversatween gyradius and average word happiness for each of tions occurring during the time period of interest. To measure these cities is presented in the Appendix, and higher resthe sentiment associated with different patterns of movement, olution versions of the four panels above can be found we use2the hedonometer introduced by Dodds et al. in [27]. online . The instrument performs a context-free measurement of the The main observation apparent Figure 2, namely happiness of a large collection of words in using the language asthat individuals who move a lot tend to appear in areas sessment by Mechanical Turk (labMT) word list, as described of [27, large population density, roughly is somewhat counterintuitive. in 30]. LabMT comprises 10,000 of the most freGiven the apparent economies oflanguage, scale offered living quently used words in the English each by of which was forpopulated happiness area, on a scale 1 (sad) to 9 (happy) by in a scored densely one of might expect to observe people using Amazon’s Mechanical Turk service [31, 32], re2 http://www.uvm.edu/storylab/share/papers/ sulting in an average happiness score for each word. Example word scores are shown in Table 1. frank2013a

Toinverse examine the relationship between movement and happithe relationship, namely that people living in less ness, we calculate expressed happiness as a function of disdensely populated areas travel further, by necessity, to tance from an individual’s expected location, as well as gytheir place of employment or grocery store, for example. radius. For the former, we grouped ten could equally Of course, these individuals withtweets large into radius be populated bins, with each group containing more than 500,000 tourists, or they could have a long commute. Looking at tweets from similar distances. The happiness of each group each point colored instead by distance from expected lowas then computed using Eqn 3, where all words written from cation (Appendix), we still see more exaggerated segregaa given distance were gathered into a single bin. For the latter, tion, withindividuals non-natives appearing predominantly in gyracities, we placed into ten equally sized groups by and native individuals tweeting in the suburbs. dius, with each group containing more than 10,000 individuals In Figure 3, we plot the mean gyradius vs land area with similar gyradii. for each city in the U.S. as happiness defined by [30]. the While there Fig. 6 plots average word against distance from expected location (A), and gyradius (B). we Starting with lo- a is considerable scatter among the cities, do observe cation, we find that tweets written close to an individual’s weak correlation indicating that individuals living in cenmore ter of mass are slightly happier than those written 1km away. populated and larger areas travel farther. Table 3 shows The happy words,cities on average, are used atmean a distance repthe least top and bottom with respect to gyradius, resentative of a short daily commute to work. Beyond this as well as the four cities discussed above. least happy distance, remarkably we find that happiness increases logarithmically with distance expected location. In Figure 4, we plot a heat mapfrom of the probability denPerhaps even more remarkably, we find an almost sity function of the normalized locations of all identical individutrend when grouping together individuals rather than tweets, als. For the purposes of the discussion, we will refer to observing that happiness also increases logarithmically with deviations from an individual’s expected location in the gyradius. Individuals with a large radius use happier words normalized reference as occurring in the than those with a smallerframe pattern of life. We find the directions trend obnorth, south, east, and west. served in Fig. 6 holds for 3 of the 4 urban areas (Los Angeles, features of the map reveal patterns SanSeveral Francisco, and Chicago), see Figs. S4,interesting S5. of movement. First, the west-to-east teardrop To explain the difference in overall expressed happiness exhibited shape of the contours demonstrates that people by different mobility groups, we turn to word shift travel graphsprein Fig. 7. Word shift graphs were introduced by Dodds and dominantly along their principle axis, namely heading

6

8

0
100

Slope =−1.269

−0.5
Frequency

80

60

B

Log−10 Probability

−1

40

20

−1.5

0 −2

−1.5 −1 −0.5 Log−10 Probability Rank 1 Tweet Hubs

0

−2

−2.5

A
1 2
−3

−3

5

Rank

10

20

30

Mode Figure 6: Representing the approximately 300 individuals for whom we have at least 800 geolocated messages, we Other 9 plot the probability of tweeting from a habitat as a function of the tweet habitat rank (A). Each dot represents a single C individual’s likelihood of tweeting from one of their habitats. The axes are logarithmic, revealing an approximate 8 Zipfian distribution with slope -1.3 [43]. (B) Distribution of the rank-1 habitat, each individual’s mode location. Probability 7 6 5

10

x 10

west from the origin along y/sy = 0, with deviations in the orthogonal direction becoming shorter and less frequent as they move farther away from the origin. 4 Second, the appearance of two spatially distinct yellow rank radius (km) city 3 regions separated by a less populated green region sug1 200.6 Martinsville, VA gests that people spend the vast majority of their time near 2 124.5 2 Middletown, OH M T W Th S Su two locations.F We refer to these locations as the work and Day 3 112.3 Elkhart, IN home habitats, where the home habitat is centered on the 4 98.8 Pottstown, PA dark red region roughly 1 standard deviation east of the 5 96.6 Decatur, Figure Above Figure 5: Representing the approximately 300IL individuals for5: whom weand have at least 800 geolocated messages, we plot the 2 origin, the work habitat is centered approximately ··· ·· probability of · tweeting from a habitat as· · a· function of the tweet habitat rank (A). Each dot represents a single individual’s standard deviations west of the origin. These locations 215 13.3 from one New York City, The NY axes are logarithmic, likelihood of tweeting of their habitats. revealing an approximate Zipfian distribution 6.15 6.15 highlight the bimodal distribution of principal axiswith corri247-1.3 [29]. 11.4 Chicago, IL slope (B) Distribution of the rank-1 habitat, each individual’s mode location. (C) A robust diurnal cycle is observed A B dor messages (Figure 5A). in 300 the hourly time statuses are updated, 8.94of day at which Los Angeles, CA with those from the mode location (black curve) occurring more often Finally, a clear asymmetry is observed about the x/sx = than other locations in the morning and evening. 387 4.33 (red curve) San Francisco & Oakland, CA Probabilities sum to 1 for each curve, with bins for each hour. 0 axis indicating the increasingly isotropic variation in 6.1 6.1 Dashed · · · vertical · ·lines · denote midnight. · · · movement surrounding the home habitat, as compared to 468 0.492 Greenville, MS the work habitat. We suspect this to be a reflection of 469 0.491 Athens, OH the tendency to be more familiar with the surroundings of 6.05 6.05 470 0.465 Key West, FL one’s home, and to explore these surroundings in a more 471 0.381 El Centro Calexico, CA social context (Figure 5B). The symmetry observed when 472 0.312 Pullman, WA reflecting about the y/sy = 0-axis is strong, demonstrat6 ing 6 the remarkable consistency of the movement patterns Table 2: Top and Bottom 5 cities with respect to mean revealed by the data. In an effort to characterize the temporal and spatial 7 gyradius, along with examples from Figure 2. 5.95 5.95 structure observed in Figure 4, in Figure 6 we examine 1 10 100 1000 1 10 100 1000 Distance (km) from Expected Location User Radiusby of Gyration (km) locations frequently visited the most prolific members of by our data set, namely the author’s roughlyexpected 300 individuals Figure 8: Tweets are grouped into ten equally populated bins the distance from their location,for and
Average Word Happiness Average Word Happiness

6.15

6.15

A
Average Word Happiness
6.1

B
Average Word Happiness
1 10 100 1000 6.1

6.05

6.05

6

6

5.95

5.95

Distance (km) from Expected Location

1

User Radius of Gyration (km)

10

100

1000

Figure 8: Tweets are grouped into ten equally populated bins by the distance from their author’s expected location, and the average happiness of words written at each distance is plotted (A). Expressed happiness grows logarithmically with Figure 6: Tweets are grouped into ten equally populated bins by the distance from their author’s expected location, and the distance from expected location. A similar trend is observed when individuals are grouped into ten equally populated average happiness of words written at each distance is plotted (A). Expressed happiness grows logarithmically with distance bins by their gyradius, and all trend words by individuals each bin are gathered (B). These observed trends from expected location. A similar is authored observed when individualsin are grouped into ten equally populated bins by their persist through variation in binning and different measures of mobility. gyradius, and all words authored by individuals in each bin are gathered (B). These observed trends persist through variation in
binning and different measures of mobility.

at as the word differences between individuals with Danforth [27, as a means for investigating elements Figure 833] plots average word happinessthe against the of dis- Looking described follows. and smallest radii ofthe gyration in Fig. 7B, see that of language responsible for happiness differences between tance from expected location (A), and gyradius (B). two Start-largest Words appearing on right increase thewe happiness large texts. As an example, consider the difference between individuals in the large radius group author the negative words ing with location, we find that tweets written close to an the 2500km distance relative 1km distance. For example, ‘hate’, ‘damn’, ‘dont’, ‘mad’, ‘never’, ‘not’ and assorted protweets authored at distances of roughly 1km and 2500km away individual’s center of mass are slightly happier than those tweets authored far from an individual’s expected location fanity less frequently, and the positive words ‘great’, ‘new’, from an individual’s expected location. The average happiness written 1km away. The least happy words, on average, are are more likely to contain the positive words ‘beach’, scores for these two distances are havg = 5.96 and havg = 6.13 ‘dinner’, ‘hahaha’, and ‘lunch’ more frequently than the small used at a distance representative of a short daily commute ‘new’, ‘great’, ‘park’, ‘restaurant’, ‘dinner’, ‘resort’, respectively. Individual word contributions to this difference radius group. Going against the trend, the large radius group to work. Beyond this least happy distance, remarkably we ‘coffee’, ‘lunch’, ‘cafe’, and ‘food’, and less likely to uses the positive words ‘me’, ‘lol’, ‘love’, ‘like’, ‘funny’, are shown in Fig. 7A, and can be described as follows. find that happiness logarithmically with distance contain the negative wordsand ‘no’, ‘not’, ‘no’, ‘hate’, Words appearing onincreases the right increase the happiness of the ‘girl’, and ‘my’ less frequently, the ‘don’t’, negative words from expected location. Perhaps even more remarkably, ‘can’t’, ‘damn’, and ‘never’ than tweets posted close 2500km distance relative 1km distance. For example, tweets and ‘last’ more frequently. Comparing with other groups, the we find an almost identical trend when grouping together to home. Words going against the trend appear on authored far from an individual’s expected location are more large radius group authors an increased frequency of words inthe individuals rather than tweets, that happiness left, decreasing the the happiness of the ‘lunch’, 2500km‘restaudistance to eating, like words ‘dinner’, likely to contain the positive wordsobserving ‘beach’, ‘new’, ‘great’, reference also ‘restaurant’, increases logarithmically with gyradius. Individuals group to the 1km group. Tweets close to home andrelative ‘food’, and make less reference to traffic conges‘park’, ‘dinner’, ‘resort’, ‘coffee’, ‘lunch’, ‘cafe’, rant’, and ‘food’, and less likely tohappier contain the negative with a large radius use words than words those ‘no’, with ation. are more likely to contain the positive words ‘me’, ‘lol’, ‘don’t’, ‘not’, ‘hate’, ‘damn’, and ‘never’ than tweets the ‘haha’, two figures, we note that with smaller pattern of ‘can’t’, life. We find the trend observed in Fig- Comparing ‘love’, ‘like’, ‘my’, ‘you’, andindividuals ‘good’. Moving radius laugh (e.g ‘hahaha’) than those with a that smallthe posted to for home. going against the trend appear ure 8close holds 3 ofWords the 4 urban areas (Los Angeles, Sanlarge clockwise, themore three insets in Figure 9A show onFrancisco, the left, decreasing the happiness of the closer to theirthe expected location laugh to and Chicago) visualized in 2500km Figure 2distance (see Ap-radius, two but textindividuals sizes are comparable, biggest contributor more than those far from home. group relative to the 1km group. Tweets close to home are pendix). the happiness difference is the decrease in negative words These word differences reveal the relationship between an more likely to contain the positive words ‘me’, ‘lol’, ‘love’, authored by individuals very far from their expected ‘like’, ‘haha’, ‘my’,the ‘you’, and ‘good’. clockwise, the individual’s pattern of movement and their experiences. It is To explain difference inMoving expressed happiness location, and the 50 words listed make up roughly 50% three insets in Fig. 7A show that thegroups, two textwe sizes areto comsurprising to observe regular international travelers tweetexhibited by different mobility turn wordnotof the total difference between the two bags of words. parable, the biggest contributor to the happiness difference is ing the food they enjoy on vacation. Indeed, we expect shift graphs in Figure 9. Word shift graphs were intro- about Note that the relatively small differences havg the decrease in negative words authored by individuals very that individuals capable of tweeting at a great distance in duced by Dodds and Danforth [18, 19] as a means for scores reflect a small signal, yet one that we havefrom shown far from their expected location, and the 50 words listed make their expected location are more likely to benefit from an adinvestigating the elements of language responsible for previously can be resolved by our hedonometer [19]. up roughly 50% of the total difference between the two bags vantaged socioeconomic status, which they happily update fredifferences between two large texts. As anquently. Additional word shift comparisons that for expressed the four hapurban of happiness words. Previous work has demonstrated example, consider the difference between tweets authored correlates strongly with many socioeconomic indicaNote that the relatively small differences in havg scores re- piness areas investigated earlier are provided in the Appendix. at a distances of roughly 1km 2500km away from flect small signal, yet one that weand have shown previously canantors [26]. Nevertheless, setting aside these luxurious words, Looking at the word differences between individuals expected location. average happiness still see a general decline in of thegyration use of negative words aswe beindividual’s resolved by our hedonometer [27]. The Additional word shift wewith large and small radii in Figure 9B, scores forfor these two distances are havg = 5.96are andindividuals farther in from their expected In fact,the comparisons the four urban areas investigated earlier see that travel individuals the large radiuslocation. group author provided Figs. S6, S7. havg =in 6.the 13 Supplemental respectively. Material, Individual word contributionsof the four contributions to the difference in happiness between

radii.

to this difference are shown in Figure 9A, and can be

8

11

T
1

comp

T : Distance from Expected Location 1.08 (km) (h =5.96) ref avg : Distance from Expected Location 2638.95 (km) (h =6.13)
avg

T
1

comp

Tref: User Radius of Gyration 3.11 (km) (havg=6.01) : User Radius of Gyration 3188.08 (km) (h =6.10)
avg

no − shit − + me + love + lol beach + don’t − new +

shit − + me + lol ass −

5

5

+ love hate − + like great + − no

10

+ haha − terminal + like not − hate − international + die − can’t − + you park + great + bitch − ass − damn − never − nigga − restaurant + con − dont − dinner + bad − hell − home + coffee + + my lunch +

10

15

15

20

A

20

B

25

25

nigga − bitch − damn − new + dont − mad − dinner + never − not − hahaha + lunch + today + hell − gone − niggas − home + die − awesome + bad − − con bitches − food + restaurant + bored − tired − ill − + funny hurt −
Text size: Tref Tcomp

Word rank r

30

Word rank r
Text size: Tref Tcomp

30

35

+ hahaha ain’t − cafe + resort + + good food + sick − bored − mad − niggas − tired − − waiting − earthquake + sleep 100 + bed hotel +

35

10

0

10

0

+ girl photo + ugly − + my − falling stupid − + sleep sick − trip + cant − mean −
0
r i=1

40

10

1

40
Balance: −80 : +180 + +

10

1

Balance: −113 : +213 + +

10

2

10

2

45
10
3

45
10
3

10

4

0

10

4

50

r i=1

100

havg, i

h

− last fuckin −

50 −20

avg, i

−10

−5 0 5 Per word average happiness shift h (%) avg, r

10

−10 0 Per word average happiness shift h

avg, r

10 (%)

20

Figure 9: (Color online) Word shift graphs comparing (A) the lowest average word happiness distance from home group to the words authored farthest from home, which also has the largest average word happiness and (B) the Figure 7: (Color online) Word shift graphs comparing (A) the lowest average word happiness distance from home group to smallest gyradius group from with home, the largest group. The words in the word shifts from to bottom appear the words authored farthest which gyradius also has the largest average word happiness and (B) the top smallest gyradius in decreasing order of ranked percentage contribution to the overall average happiness difference ( D h ) of the group with the largest gyradius group. The words in the word shifts from top to bottom appear in decreasing orderavg of ranked two texts being compared. The +/- symbols whether the word antwo average happiness scoreThe that is symbols happy or sad percentage contribution to the overall averageindicate happiness difference (∆ havg ) has of the texts being compared. +/relative to thethe entire text . The symbols / # indicate whether arelative word was used more or in Tcomp relative to indicate whether word has T an average happiness" score that is happy or sad to the entire text T . The symbols ↑/↓ re f re fless indicate word was used panel more or less in Tcomp relative to usage in Tre f . The left inset panel shows how the top four usage whether in Tre f . aThe left inset shows how the ranked top contributing words to Dhavg combine in ranked sum. The contributing words to ∆right havg combine in total sum. contribution The four circles the lower right show(+ the"total the four text word circles in the lower show the of in the four word types , "contribution , + #, #). of Relative size is types (+ ↑ , − ↑ , + ↓ , − ↓ ) . Relative text size is represented by the grey squares. See [27] for further details and examples of represented by the grey squares. See [19] for further details and examples of word shift graphs.
word shift graphs. 9 12

words authored close to home vs. far from home, this decline in negative words when far from home is the largest component (bottom right inset, Fig. 7). Discussion Using 37 million geolocated tweets authored in 2011, this study characterizes the pattern of life of over 180,000 individuals in the United States. While observed mobility patterns agree qualitatively with previous work investigating cellphone data [2], we are able to connect movement patterns to changes in word usage for the first time. Our main finding is that expressed happiness increases logarithmically with both distance from expected location and gyradius, largely because individuals who travel farther use positive, food related words more frequently, and negative words and profanity less frequently. Several methodological issues are raised by the use of Twitter messages to characterize mobility and happiness. Considering Twitter as a source, we note that according to the Pew Internet & American Life Project, roughly 15% of adults in the U.S. were actively using Twitter at the end of 2011 [34]. While this fraction represents a substantial group of Americans, we have no data to quantify the demographic group represented by the subset of these 15% who specifically choose to geotag a large percentage of their messages. Nevertheless, since we threshold the sample to include individuals who have geolocated more than approximately 300 of their messages in 2011, we suspect that the large majority of individuals represented in our study regularly do so as a matter of daily life, as opposed to geolocating messages only when encountering a novel experience such as a vacation. Regarding word usage as a proxy for happiness, accessing the internal emotional state of individuals is beyond the scope of our instrument. We do believe however, that when aggregated, the words used by large groups of individuals reflect their culture in ways not captured by surveys or self-report. Indeed, we see the hedonometer as complementing more traditional economic methods for characterizing economic and societal health, such as the Gross Domestic Product or Consumer Confidence Index. Using the same collection of geolocated messages explored here, the hedonometer was recently employed by Mitchell et al. [26] to characterize trends in word usage for cities. Expressed happiness was shown to correlate to hundreds of demographic, socio-economic, and health measures, with interactive evidence available in the article’s online Appendix [35]. This work contributes to a growing body of literature aimed at observing, describing, modeling, and ultimately explaining the spatiotemporal dynamics of large-scale socio-technical systems. Natural extensions of this work might combine topological measures of network interactions with geospatial data to predict the likelihood of new links appearing in a social network [36], or to measure the spread of emotions through geographical and topological space [37]. The mobility patterns described here could also be combined with more traditional surveys (e.g. census data) to inform public policy regarding many important issues, for example relating to the ‘obesity epidemic’ and changes in word usage at the level of individual neighborhoods targeted by public health campaigns.

Methods In an effort at quality control for the geolocated messages, we identified and removed messages posted by robotic accounts and programmed tweeting services designed to automatically send tweets typically not reflecting information about human activity. Preliminary analyses revealed a noticeable presence of bots posting geolocated messages referring to weather, earthquakes, traffic, and coupons. We identified and ignored tweets collected from individuals for whom at least half of their tweets contained any of the words ‘pressure’, ‘humid’, ‘humidity’, ‘earthquake’, ‘traffic’ or ‘coupon’. Messages referencing Foursquare check-ins (typically of the form ‘I’m at starbucks http://4sq.com/qrel9g’) were retained for the purpose of characterizing the mobility profile of each individual. However, for results involving happiness, we ignored Foursquare check-in tweets as their content is unlikely to directly reflect sentiment. Finally, to ensure that individual movement profiles are based on a reasonably sized collection of locations, for this study we focus on individuals for whom we have at least 30 geolocated tweets. Given the uniformity of the random sample provided by the gardenhose, we can assume these individuals geolocated a minimum of approximately 300 status updates in 2011. For reasons of privacy, we ignored all user specific information including individual names. In addition, where the trajectories traced out by specific individuals are visualized, we obscured the coordinate system of reference. Tweets were assigned to urban areas as defined by the 2010 United States Census Bureaus MAF/TIGER (Master Address File/Topologically Integrated Geographic Encoding and Referencing) database [38]. The gyradius for individual a is defined as r(a) = 1 N (a) ( pi − p(a) )2 ∑ ( a ) N i=1
(a)
(a)

(1)

where the two-dimensional vector pi is the ith position in the trajectory of individual a, given by the geolocation of that individual’s ith tweet, as observed in our database. N (a) is the total number of tweets from individual a, and p(a) = (a) (a) 1/N (a) ∑N i=1 pi is the center of mass of their trajectory, which we denote their expected location. Note that if we consider each message to be a prediction of an individual’s location, then the gyradius is in fact the root mean square error (RMSE) of that prediction. Fig. S8 plots the Complementary Cumulative Distribution Function (CCDF) of the gyradii of all individuals. To compare the shape of individual trajectories, we normalize for both differences in gyradius and direction of trajectory. Considering each individual’s trajectory as a set of (x, y)pairs {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, we calculate the two dimensional matrix known as the tensor of inertia, considering each point in a individual’s trajectory as an equally weighted mass at location (xi , yi ). We then find this tensor’s eigenvectors and eigenvalues. The eigenvector corresponding to the

10

largest eigenvalue represents the axis along which most of the individual’s trajectory occurs (hereafter called the individual’s principal axis). Previous work has demonstrated that for most individuals, this axis is parallel to the corridor between their work location and home [2, 4]. To normalize the different compass orientations of individual trajectories, we rotate the coordinate system of each individual so that their principal axis points due west. The expected location for each individual (x ¯, y ¯) is then used to translate their position vector, i.e. (xi − x ¯, yi − y ¯), to ensure that the shape of each individual’s trajectory is in a common frame of reference. However, the distances travelled by each individual vary widely despite their shared orientation (e.g. pedestrian vs. airline commute). In order to compare these trajectories, we calculate the standard deviation σx , σy for a given individual’s trajectory, and divide their x- and y-coordinates by σx and σy , respectively. For more information about this process, including a pair of example trajectory normalizations, see Figs. S9-S13. In an attempt to characterize time spent in each location, we (a) define the ith tweet habitat for individual a, denoted Hi , to be a circle within which individual a posted at least 10 messages [8]. The center of the circle is defined by the average position of all messages appearing in the habitat, and the radius of the circle is chosen such that each tweet posted within a habitat is at most 100 meters away from the center, and no habitats overlap. To measure the importance of habitat i to individual a, we count the number of messages appearing in (a) each tweet habitat and produce the ranking R(Hi ) for individual a. The probability that individual a tweets from habitat (a) Hi is P(Hi ) =
(a)

word ‘happy’ ‘hahaha’ ‘fresh’ ‘cherry’ ‘pancake’ ‘piano’ ‘and’ ‘the’ ‘of’ ‘down’ ‘worse’ ‘crash’ ‘:(’ ‘war’ ‘jail’

havg (wi ) 8.30 7.94 7.26 7.04 6.96 6.94 5.22 4.98 4.94 3.66 2.70 2.60 2.36 1.80 1.76

Table 1: Example language assessment by Mechanical Turk (labMT) [27, 30] words and scores. Words with neutral scores 4 < havg (wi ) < 6 are colored gray and ignored when assigning the happiness score to a large text.

References
[1] Simini, F., Gonz´ alez, M. C., Maritan, A., Barab´ asi, A.-L. A universal model for mobility and migration patterns. Nature. Vol. 484, No. 7392. (2012) [2] Gonz´ alez, M. C., Hidalgo, C. A., Barab´ asi, A. L. Understanding individual human mobility patterns. Nature. Vol. 453, pp. 779-782, (2008) doi:10.1038/nature06958 [3] Eagle, N., Macy, M., Claxton, R. Network Diversity and Economic Development. Science. Vol 328, No. 5981, pg. 1029-1031 (2010) [4] Song, C., Qu, Z., Blumm, N., Barabsi, A.-L. Limits of Predictability in Human Mobility. Science. Vol 327, No. 5968, pg. 1018-1021 (2010)

|Hi | N (a)

(a)

(2)

(a) where |Hi | is the number of tweet locations contained in [5] de Montjoye, Y.-A., Hidalgo, C.A., Verleysen, M., Blondel, V.D. Unique in the Crowd: The privacy bounds of human mobility. Nature Scientific (a) Reports. Vol 3, No. 1376. (2013) Hi . Notice that the habitat probabilities for individual a may not sum to one since it may be the case that individual a has [6] Wang, D., Pedreschi, D., Song, C., Gionatti, F., Barab´ asi, A.-L. Human mobility, social ties, and link prediction. Proceedings of the 17th ACM tweet locations that are not contained in a tweet habitat. HereSIGKDD international conference on Knowledge discovery and data minafter, we will refer to an individual’s most frequently visited, ing, pp 1100-1108 (2011) or rank-1 habitat, as their mode location. Using the labMT scores [27], we determine the average hap- [7] Brockmann, D. D., Hufnagel, L. & Geisel, T. The scaling laws of human travel. Nature 439, 462-465 (2006). piness (havg ) of a given text T containing N unique words by

havg (T ) =

∑N i=1 havg (wi ) · f i ∑N i=1 f i

= ∑ havg (wi ) · pi
i=1

N

(3)

[8] Bagrow, J. P., Lin Y-R. Mesoscopic Structure and Social Aspects of Human Mobility. PLoS ONE 7(5): e37676. (2012) doi:10.1371/journal.pone.0037676

where fi is the frequency with which the ith word wi , for which we have an average word happiness score havg (wi ), occurred in text T . The normalized frequency of wi is then given by pi = fi / ∑N i=1 f i . The hedonometer instrument can be tuned to emphasize the most emotionally charged words by removing words within ∆havg of the neutral score of havg = 5. It has been shown that ignoring these neutral words with 4 < havg (wi ) < 6 provides a good balance of sensitivity and robustness, and thus we chose ∆havg = 1 for this study [27].

[9] Ramos-Fernandez, G. et al. L´ evy walk patterns in the foraging movements of spider monkeys (Ateles geoffroyi). Behav. Ecol. Sociobiol. 273, 1743-1750 (2004) [10] Palla, G., Barab´ asi, A.-L. & Vicsek, T. Quantifying social group evolution. Nature 446, 664-667 (2007). [11] Hidalgo, C. A. & Rodriguez-Sickert, C. The dynamics of a mobile phone network. Physica A 387, 3017-3024 (2008). [12] Barab´ asi, A.-L. The origin of bursts and heavy tails in human dynamics. Nature 435, 207-211 (2005). [13] Schlich, R. & Axhausen, K.W. Habitual travel behaviour: Evidence from a six-week travel diary. Transportation 30, 13-36 (2003). [14] Eagle, N. & Pentland, A. Eigenbehaviours: identifying structure in routine. Behav. Ecol. Sociobiol. (in the press).

11

[15] Klafter, J., Shlesinger, M. F. & Zumofen, G. Beyond Brownian motion. Phys. Today 49, 33-39 (1996). [16] Gonzalez, M. C., Lind, P. G. & Herrmann, H. J. A system of mobile agents to model social networks. Phys. Rev. Lett. 96, 088702 (2006). [17] Horner, M. W. & O-Kelly, M. E. S Embedding economies of scale concepts for hub networks design. J. Transp. Geogr. 9, 255-265 (2001). [18] Kitamura, R., Chen, C., Pendyala, R. M. & Narayaran, R. Microsimulation of daily activity-travelpatterns for travel demand forecasting. Transportation 27,25-51 (2000). [19] Colizza, V., Barrat, A., Barthelemy, M., Valleron, A.-J. & Vespignani, A. Modeling the worldwide spread of pandemic influenza: Baseline case and containment interventions. PLoS Medicine 4, 95-110 (2007). [20] Eubank, S. et al. Controlling epidemics in realistic urban social networks. Nature 429, 180-184 (2004). [21] Hufnagel, L., Brockmann, D. & Geisel, T. Forecast and control of epidemics in a globalized world. Proc. Natl Acad. Sci. USA 101, 1512415129 (2004). [22] Hedstrom, P. Experimental Macro Sociology: Predicting the Next Best Seller. Science 311 (5762), 786-787. (2006). [DOI:10.1126/science.1124707] [23] Tumasjan, A., et al. Predicting elections with Twitter: What 140 characters reveal about political sentiment. Proceedings of the fourth international aaai conference on weblogs and social media. (2010). [24] Kirilenko, A., et al. The flash crash: The impact of high frequency trading on an electronic market. Available at SSRN 1686004 (2011). [25] http://wheresgeorge.com [26] Mitchell, L., Frank, M. R., Harris, K. D., Dodds, P. S., Danforth, C. M. The Geography of Happiness: Connecting Twitter sentiment and expression, demographics, and objective characteristics of place. arXiv. (2013). http://arxiv.org/abs/1302.3299 [27] Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. A., Danforth, C. M. Temporal Patterns of Happiness and Information in a Global-Scale Social Network: Hedonometrics and Twitter. PLoS ONE. 6(12): e26752. (2011). doi:10.1371/journal.pone.0026752 [28] http://www.uvm.edu/storylab/share/papers/frank2013a [29] Zipf, G. Relative frequency as a determinant of phonetic change, Harvard Studies in Classical Philology. (1929). [30] Kloumann, I. M., Danforth, C. M., Harris, K. D., Bliss, C. A., Dodds, P. S. Positivity of the English Language. PLoS ONE 7(1): e29484. (2012). doi:10.1371/journal.pone.0029484 [31] Amazon’s Mechanical Turk service. https://www.mturk.com/. Accessed October 24, 2011. Available at

Acknowledgements We would like to thank for their helpful comments and input Brian Tivnan, James Bagrow, Yu-Ru Lin, Taylor Ricketts, Austin Troy, and Lisa Aultman-Hall. In addition, we would like to thank MITRE Corporation, the UVM Transportation Research Center, the Vermont Complex Systems Center, and the Vermont Advance Computing Core for funding. Author Contributions C.M.D. and P.S.D. designed the research, M.R.F. prepared the figures, and C.M.D. and M.R.F. wrote the manuscript. M.R.F., L.M., P.S.D., and C.M.D analyzed the data and reviewed the manuscript. Competing Financial Interests The authors declare no competing financial interests.

[32] Rand, D. G. The promise of Mechanical Turk: How online labor markets can help theorists run behavioral experiments. J Theor Biol. (2011). [33] Dodds, P. S., Danforth, C. M. Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and Presidents. Journal of Happiness Studies. (2009). dpi: 10.1007/s10902-009-9150-9 [34] Aaron S., Joanna B. Twitter Use 2012. Technical report, Pew Research Institute, 2012. [35] http://www.uvm.edu/storylab/share/papers/mitchell2013a [36] Onnela, J.-P., Arbesman, S., Gonzlez, M. C., Barabsi, A.-L., Christakis,N. A. Geographic Constraints on Social Network Groups. PLoS ONE 6(4): e16939. (2011) doi:10.1371/journal.pone.0016939 [37] Bliss, C. A., Kloumann, I. M., Harris, K. D., Danforth, C. M., Dodds, P. S. Twitter reciprocal reply networks exhibit assortativity with respect to happiness. Journal of Computational Science 3(5), pp. 388-397, (2012). http://dx.doi.org/10.1016/j.jocs.2012.05.001 [38] U.S. Census Bureau Geography Division. 2010 Census TIGER/Line Shapefiles. http://www.census.gov/geo/www/tiger/ tgrshp2010/tgrshp2010.html, accessed February 2013.

12

Figure Captions Figure 1: Each point corresponds to a geolocated tweet posted in 2011. Twitter activity is most apparent in urban areas. Note that the image contains no cartographic borders, simply a small dot for each message. Legend: A (U.S.), B (Washington, D.C.), C (Los Angeles, C.A.), and D (Earth). Figure 5: Representing the approximately 300 individuals for whom we have at least 800 geolocated messages, we plot the probability of tweeting from a habitat as a function of the tweet habitat rank (A). Each dot represents a single individual’s likelihood of tweeting from one of their habitats. The axes are logarithmic, revealing an approximate Zipfian distribution with slope -1.3 [29]. (B) Distribution of the rank-1 habitat, each individual’s mode location. (C) A robust diurnal cycle is observed in the hourly time of day at which statuses are updated, with those from the mode location (black curve) occurring more often than other locations (red curve) in the morning and evening. Probabilities sum to 1 for each curve, with bins for each hour. Dashed vertical lines denote midnight.

Figure 2: (Color online) The gyradius, calculated for each individual, is shown for each tweet authored in four example cities. Reflecting the pattern of urban life, we find messages authored by large radius individuals to be more likely to appear in the main downtown area of each city, while messages authored by small radius individuals tend to appear outside of the main urban area. Histograms of gyradii for each city are shown in Fig. S1, along with tweet locations colored by distance from expected location (Fig. S2).

Figure 3: (Color online) The probability density function of observing an individual in their normalized reference frame, where the origin corresponds to each individual’s expected location, and σy = 0 corresponds to their principle axis. This map shows the positions of over 37,000 individuals, each with more than 50 locations, in their intrinsic reference frame.

Figure 6: Tweets are grouped into ten equally populated bins by the distance from their author’s expected location, and the average happiness of words written at each distance is plotted (A). Expressed happiness grows logarithmically with distance from expected location. A similar trend is observed when individuals are grouped into ten equally populated bins by their gyradius, and all words authored by individuals in each bin are gathered (B). These observed trends persist through variation Figure 4: Looking at messages authored in the principle axis in binning and different measures of mobility. y 30 | < 1000 , we observe a clear separation corridor, defined by | σ y between the most likely and second most likely position (A). The distribution is skewed left, with movement in a heading opposite an individual’s work/home corridor observed to be highly unlikely. In addition, due to the normalization, we see that individuals are much more likely to tweet slightly east of their expected location than slightly west. The isotropy ratio Figure 7: (Color online) Word shift graphs comparing (A) the (B) measures the change in the density’s shape as a function lowest average word happiness distance from home group to of gyradius, with large radius individuals exhibiting a less cir- the words authored farthest from home, which also has the cular pattern of life. Standard errors are plotted, but are only largest average word happiness and (B) the smallest gyradius visible for the largest radius group. The isotropy ratio decays group with the largest gyradius group. The words in the word logarithmically with radius. shifts from top to bottom appear in decreasing order of ranked

percentage contribution to the overall average happiness difference (∆havg ) of the two texts being compared. The +/- symbols indicate whether the word has an average happiness score that is happy or sad relative to the entire text Tre f . The symbols ↑ / ↓ indicate whether a word was used more or less in Tcomp relative to usage in Tre f . The left inset panel shows how the ranked top contributing words to ∆havg combine in sum. The four circles in the lower right show the total contribution of the four word types (+ ↑, − ↑, + ↓, − ↓). Relative text size is represented by the grey squares. See [27] for further details and examples of word shift graphs.

13

Supplementary Material: Happiness and the Patterns of Life: A Study of Geolocated Tweets Morgan R. Frank, Lewis Mitchell, Peter S. Dodds, Christopher M. Danforth
Computational Story Lab, Department of Mathematics and Statistics, Vermont Complex Systems Center, Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America

New York City

Los Angeles

0.3 0.25 Probability 0.2 0.15 0.1 0.05 0 −2 −1 0 1 2 Log−10 User Radius of Gyration Chicago 3 Probability

0.3 0.25 0.2 0.15 0.1 0.05 0 −2 −1 0 1 2 Log−10 User Radius of Gyration San Francisco Bay 3

0.3 0.25 Probability 0.2 0.15 0.1 0.05 0 −2 −1 0 1 2 Log−10 User Radius of Gyration 3 Probability

0.3 0.25 0.2 0.15 0.1 0.05 0 −2 −1 0 1 2 Log−10 User Radius of Gyration 3

Figure A7: The distributions of gyradius (km) for four cities appear to be log-normal. The mode distance (binned) is Figure S1: The distributions of gyradius (km) for four cities larger for Los Angeles and San Francisco than for Chicago and New York City. We note that these distributions were appear to be log-normal. The mode distance (binned) is larger calculated for all individuals whose expected location fell within the latitude and longitude bounds of Figure 2, and for Los Angeles and San Francisco than for Chicago and New thus reflect a modified set of individuals than those identified with cities in Figure 3. York City. We note that these distributions were calculated for all individuals whose expected location fell within the latitude and longitude bounds of main text Fig. 2, and thus reflect a modified set of individuals than those identified with cities in Fig. S3.

19

14

Figure S2: (Color online) The distance from expected location, calculated for each individual, is shown for each tweet authored in four example cities in 2011. Messages authored far from this location are more likely to appear in the main downtown area of each city (tourists and long-distance commuters), while messages authored close to the expected location tend to appear outside of the main urban area.

15

2.5 2 1.5 1 0.5 0 −0.5 −1 2.5 Chicago San Francisco Los Angeles New York City 3 3.5 4 4.5 5 Log−10 City Population 5.5 6 6.5

2.5

Log−10 Mean Radius of Gyration (km)

Log−10 Mean Radius of Gyration (km)

A

2 1.5 1 0.5 0 −0.5 −1 4

B

Chicago San Francisco Los Angeles New York City 4.5 5 5.5 6 Log−10 City Land Area (km) 6.5 7

Figure 3: (Color online) The mean gyradius of individuals whose expected location falls within each city is plotted Figure online) The gyradius of individuagainst theS3: city’s(Color population (A) and land mean area (B). Shown are cities containing at least 50 individuals with a nonzero als whose location falls at within each city istweets. plotted gyradius, each expected individual having authored least 30 geolocated City boundaries are defined by [30] which encompasses a city’s smallerpopulation area for the four cities in(B). Figure 2. Generally, against the (A) andillustrated land area Shown are gyradius increases with city population and land containing area, with no large cities exhibiting a small meanaradius. Pearson correlations: Population r = 0.10, p = 0.03, cities at least 50 individuals with nonzero gyraLand Area r = 0.24, p = 2 ⇥ 10 7 .

dius, each individual having authored at least 30 geolocated tweets. City boundaries are defined by [38] which encomthe Pew Internet & American roughly 15% to in hundreds of demographic, socio-economic, and health passes a smaller area forLife the Project, four cities illustrated the main of adults in the U.S. were actively using Twitter at the end measures, with interactive evidence available in the artitext Fig. 2. Generally, gyradius increases with city populaof the year during which we collected data [42]. While cle’s online Appendix1 . tion and land area, awith no large cities exhibiting a small mean this fraction represents substantial group of Americans, radius. correlations: Population = 0.10, p = 0.03, we have noPearson data to quantify the demographic group ρ repre−7 . sented by the subset of these 15% who specifically choose 3 Results Land Area ρ=0 .24 , p =2 × 10

to geotag a large percentage of their messages. Nevertheless, since we threshold the sample to include individu- In Figure 2, we investigate the geographical distribution In Fig. we plot the mean gyradius vs area for each als who have S3, geolocated more than approximately 300land of of movement in four urban areas by plotting a dot for city in the U.S. as defined bythat [38]. there iseach considerable their messages in 2011, we suspect the While large majority tweet, colored by the gyradius of its author. Clockof individuals represented in our study regularly do so as correlation scatter among the cities, we do observe a weak inwise from the top left, cities are displayed in order of their a dicating matter of daily as opposedliving to geolocating messages apparent aggregate gyradius, with New York City seemthat life, individuals in more populated and larger only when encountering a novel experience such as a va- ingly exhibiting a smaller radius than the San Francisco areas travel farther. Table S1 shows the top and bottom cities cation. Bay Area. Reflecting the pattern of urban life, we find with respect to usage meanas gyradius, ashappiness, well as the citiesauthored dis- by large radius individuals to be more Regarding word a proxy for ac- four messages cussed above. cessing the internal emotional state of individuals is be- likely to appear in the main downtown area of each city, yond the scope of our instrument. We do believe however, while messages authored by small radius individuals tend that when aggregated, words used by large groups rank radiusthe (km) city of to appear in less densely populated areas. For example, individuals culture in ways not captured by VA in Chicago, many individuals writing from downtown ex1 reflect their 200.6 Martinsville, surveys or self-report. Indeed, we see the hedonometer hibit an order of magnitude greater radius than individuals 2 124.5 Middletown, OH as complementing more traditional economic methods for posting in areas outside of the city. 3 112.3 IN characterizing economic and societal health, Elkhart, such as the In the greater Los Angeles area, we see several clusters 4 98.8 Pottstown, Gross Domestic Product or Consumer Confidence Index.PA of individuals with larger radius in downtown Los AngeUsing the ex5 same collection 96.6 of geolocated messages Decatur, IL les, as well Long Beach, Santa Monica, and Disneyland plored here, was recently employed · · · the hedonometer ··· · · · by in Anaheim, while less densely populated areas are seen Mitchell et al. [17] to characterize trends in word usage 215 13.3 New York City, NY 1 http://www.uvm.edu/storylab/share/papers/ for cities. Expressed happiness was shown to correlate

247 300 387 ··· 468 469 470 471 472

11.4 8.94 4.33 ··· 0.492 0.491 0.465 0.381 0.312

Chicago, IL mitchell2013a Los Angeles, CA San Francisco & Oakland, CA 6 ··· Greenville, MS Athens, OH Key West, FL El Centro Calexico, CA Pullman, WA

Table S1: Top and Bottom 5 cities with respect to mean gyradius, along with the four cities investigated in main text Fig 2.

16

New York City

Los Angeles

Average Word Happiness

6

Average Word Happiness
1 10 100

6.05

6.05

6

5.95

5.95

5.9

5.9

5.85

5.85

Distance (km) from Expected Location

Distance (km) from Expected Location

1

10

100

Chicago

San Francisco Bay Area

Average Word Happiness

6

Average Word Happiness
1 10 100

6.05

6.05

6

5.95

5.95

5.9

5.9

5.85

5.85

Distance (km) from Expected Location

Distance (km) from Expected Location

1

10

100

Figure A8: For New York City, Los Angeles, Chicago, and the San Francisco Bay Area, we group messages into Figure S4: For sized New York City, Angeles, Chicago, and the San Bay Area, we group messages into equally sized equally bins by theLos distance from expected location ofFrancisco their author, and measure the average word happiness of bins byeach the distance from expected location of their author, and measure average wordthe happiness of of each group. group. These plots exhibit similar trends to that observed inthe Figure 8A with exception New YorkThese City. plots exhibit similar trends to that observed in main text Fig. 6A with the exception of New York City.

20 17

New York City

Los Angeles

Average Word Happiness

6

Average Word Happiness
1 10 100

6.05

6.05

6

5.95

5.95

5.9

5.9

5.85

5.85

User Radius of Gyration (km)

User Radius of Gyration (km)

1

10

100

Chicago

San Francisco Bay Area

Average Word Happiness

6

Average Word Happiness
1 10 100

6.05

6.05

6

5.95

5.95

5.9

5.9

5.85

5.85

User Radius of Gyration (km)

User Radius of Gyration (km)

1

10

100

Figure A9: For New York City (130 individuals/bin), Los Angeles (175 individuals/bin), Chicago (125 individuFigure S5: For New York City (130 individuals/bin), Los Angeles (175 individuals/bin), Chicago (125 individuals/bin), and als/bin), and the San Francisco Bay Area (63 individuals/bin), we group individuals into equally sized bins by their the San gyradius Francisco Bay Area (63 individuals/bin), we group individuals into equally sized binssimilar by their gyradius and measure and measure the average word happiness of each group. These plots exhibit trends to that observed the average word 8B happiness ofexception each group. These plots exhibit similar trends that observed in main Fig. 6B with in Figure with the of the largest radius group in the San to Francisco Bay Area, and text New York City asthe a exception of the largest radius group in the San Francisco Bay Area, and New York City as a whole. whole.

21 18

T
1

Tref: User Radius of Gyration 6.55 (km) (havg=6.01) : User Radius of Gyration 292.03 (km) (havg=6.06) comp
shit − car + + lol − no 1

T

Tref: User Radius of Gyration 13.26 (km) (havg=6.02) : User Radius of Gyration 292.03 (km) (havg=6.06) comp
car + shit − me + die − − no traffic − + haha + love + lol

5 + love

me + ass − hate − bitch − nigga − + like dont − + haha damn − mad − never − − con niggas − hell − great + tired − gone − bitches − google + bored − hurt − cant − sick − − last weekend + + sleep stupid − ill − home + awesome + + my + girl + funny ugly − cry − lie − + baby
10
2

5

10

10

hate − ass − bitch − dont − + like − con rent − accident − not − nigga − google + mad − never − damn − hell − bad − home + − miss + good tired −

15

15

20

20

Word rank r

25

Word rank r

25

30

30

+ bed + you awesome + dinner + lunch + ill − bitches − bored − delay − zit − − last sorry − great + sick − cant − fuckin − + sleep niggas − cry − stop − stupid −
Balance: −119 : +219 + +

35

35
Text size: Tref Tcomp

10

0

10

0

Text size: Tref Tcomp

40

10

1

40
Balance:

10

1

45
10
3

fuckin − mean − fun + dinner + + you lunch + bad − hungry −

−125 : +225 + +

10

2

45
10
3

10

4

0

50 −20

r i=1

100

10

4

0

havg, i

50

r i=1

100

havg, i

−10 0 10 Per word average happiness shift h (%) avg, r

20

−15 −10 −5 0 5 Per word average happiness shift h

avg, r

10 (%)

15

Figure A10: (Color online) We compare thekm 6.55 km gyradius the 292.03 km gyradius (Left). We that Figure S6: (Color online) We compare the 6.55 gyradius group group versusversus the 292.03 km gyradius groupgroup (Left). We find find that the 292.03 km group has relatively frequent use of the words ‘car’ and ‘weekend’ suggesting that this group the 292.03 km group has relatively frequent use of the words ‘car’ and ‘weekend’ suggesting that this group travels on the travels on the to weekends perhaps vacation home use of(Right) the word ‘home’. (Right) We km compare weekends perhaps a vacation home to as asuggested by useas ofsuggested the word by ‘home’. We compare the 13.26 gyradius the 13.26 km gyradius group versus the 292.03 km gyradius group. We find that the 292.03 km group uses the group versus the 292.03 km gyradius group. We find that the 292.03 km group uses the word ‘car’ more frequentlyword than the more frequently than the 13.26 which, interestingly, usesAgain the word more frequently. Again 13.26‘car’ km group which, interestingly, uses km the group word ‘traffic’ more frequently. the ‘traffic’ increased relative usage of thesethe words increased relative usage of these words seems fitting for a groups with these patterns of movement. seems fitting for a groups with these patterns of movement. 22 19

T
1

comp

Tref: User Radius of Gyration 4.79 (km) (havg=5.87) : User Radius of Gyration 223.64 (km) (havg=6.06)
shit − ass − + lol hell − nigga − − last fun + cant − great + gone − + me wait − − sick + life 1

T

comp

Tref: User Radius of Gyration 0.87 (km) (havg=5.98) : User Radius of Gyration 123.54 (km) (havg=6.07)

+ me shit − no − problem −

5

5

− bitch − lost haha + + sleep scared −

10

10

− last win + − ill − sorry

15 − die

wrong − not − ill −

15 − dont − kill

ugh − hate −

20

− down + like + my + sleep + cute

20

damn − bored − nothing − weekend + cheated − + hope funny + lol + cough − pain − + starting faggot − ugly − fell − bitches − dead − − poor injury − − bomb − mean − mad + super family + + playing fuckin −
10
2

Word rank r

25

30

weekend + death − won + song + game + thanks + mad − coffee + + money worst −

Word rank r
Text size: Tref Tcomp

weak −

25

30

35

+ sex shooting − fuckin −
10
0

35

− killing new + − fight good + lunch + + girl bitch − win + best + war − problem − evil − idiot −

10

0

Text size: Tref Tcomp

40

10

1

40
Balance:

10

1

Balance: −430 : +530 + +

10

2

− blocked lies − like + surgery − shame − cool +
0
r i=1

45
10
3

−152 : +252 + +

45
10
3

10

4

0

50

r i=1

100

10

4

100

havg, i

50 −30

havg, i

+ girls bad −

−10 −5 0 5 Per word average happiness shift h

10 (%)

avg, r

−20 −10 0 10 Per word average happiness shift h

avg, r

20 (%)

30

Figure A12: (Color online) (Left) A word shift comparing the 4.79 km gyradius group to the 223 km gyradius for Figure S7: (Color online) (Left) A word comparing the 4.79 km gyradius group to the 223 gyradius for Chicago. Chicago. We observe the first groupshift is less happy because of increased usage of profanity and km negative words like We observe the first group is less happy because of increased usage of profanity and negative words like ‘can’t’, ‘gone’, and ‘can’t’, ‘gone’, and ’wrong’. (Right) A word shift comparing the .87 km gyradius group to the 123.54 km gyradius ’wrong’. (Right) word comparing the .87 km gyradius group to the km gyradius for the San Francisco Bay group for A the San shift Francisco Bay Area. We find the second group to123.54 be happier because group of an increase in positive Area. We findlike the ‘haha’, second‘win’, group‘weekend’, to be happier because an along increase inapositive words like ‘haha’, ‘win’, ‘weekend’, ‘funny’, words ‘funny’, and of ‘lol’, with decrease in negative words like ‘no’, ‘problem’, and ‘lol’, along with a decrease in negative words like ‘no’, ‘problem’, and ‘hate’. and ‘hate’. 24

20

10

0

10 CCDF 10

−2

−4

10

−3

10 10 10 Radius of Gyration (km)

−1

1

3

Figure S8: Complementary Cumulative Distribution Function (CCDF) for the gyradii of all users with at least 30 geotagged messages. Gonzalez [2] found this distribution to be well modeled by a truncated power law, exponential tail.

21

Latitudinal Distance (km) from Expected Location

1 0 −1 −2 −3 −4 −5 −6 −7 −8 −15 −10 −5 0 5 Longitudinal Distance (km) from Expected Location

Latitudinal Distance (km) from Expected Location

Normalizing Human Trajectory To compare the shape of trajectories of individuals traveling in different directions and over different distances, we use the methods introduced by Gonz´ alez et al. [2]. We will examine the normalization steps for two individuals we will call user A and user B. We have 768 geolocated tweets for user A and 1,882 geolocated tweets for user B. User A has gyradius rA = 463.61 km and user B has gyradius rB = 54.28 km. Fig. S9 represents the geospatial tweet locations for user A and user B, but we shifted their coordinate system to maintain their anonymity. We have also allowed for a slight spatial separation between the locations for user A and the locations of user B for clarity.
4
2.07 2.065

User A

User B 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 −1 −0.5 0 0.5 Longitudinal Distance (km) from Expected Location

aa aa

User A User B

2

2.06 2.055 2.05 2.045 2.04

a a ⌦ ⌦ B B

0 Latitude Degrees

2.035 2.54 2.56 2.58 2.6 2.62 2.64

−2


−4


2

B

B

B

B

B

B

B

Figure S10: (Color online) Tweet locations for User A and User B transformed to the distance in kilometers from their expected locations, respectively.

1.5 1 0.5

−6

0 −0.5 6

7

8

9

10

−8 −8

−6

−4

−2

0 2 Longitude Degrees

4

6

8

10

Figure 8: Tweet locations for User A and User B.

Figure S9: (Color online) Tweet locations for User A and User B.
8 4 User A User B User A User B

5 User A

User B In Fig. S10, we apply the linear transformation shifting Principal Axis A Principal Axis B each location for the user to the distance in kilometers from their center of mass, i.e. the expected location of the user. The difference in gyradius between user A and user B is still 0 very apparent in the axis ranges for this plot. Notice that the directional relationships between the tweet locations for each user have still been preserved. We can see that user A travels predominantly in a southwest direction, while user B travels −5 primarily in a northwest direction. Figure 12: The rotated tweet User of A and User To normalize direction of travel, let locations the of set Figure 11: The results after rotating the for locations of User B after normalizing for radius of gyration. The origin repA and User B. We see that they now both principal tweet locations forhave user i be represented by the set of resents the center of mass of the respective individuals’ axes of trajectory pointing due west. <~ pa tweet > from equation (2). equally weighted masses at trajectory, each namely of the locations −10 {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. Now we calculate the tensor −16 −14 −12 −10 −8 −6 −4 −2 0 2 4 Longitudinal Distance (km) from Expected Location of inertia (I ) for each set of weighted (x, y)-points as  n 15  n 2 − x y x ∑ j j  ∑ j Figure S11: (Color online) The tweet locations for User A and   j=1 j=1 I= n  n User B along with a line representing the principal axis for that   − ∑ x jy j ∑ y2j user.
6 3 Latitudinal Distance (km) from Expected Location 2 2 1 0 0 −2 −1 −4 −2 −6 −3.5 −3 −2.5 −2 −16 −14 −12 −10 −8 −6 −4 −2 Longitudinal Distance (km) from Expected Location 0 2 −1.5 x/σx −1 −0.5 0 0.5

j=1

j=1

The eigenvector of I corresponding to the largest eigenvalue of I represents the direction along which most of user i’s trajectory occurs; we call this the principal axis for user i (see Fig. S11). 22

Latitudinal Distance (km) from Expected Location

4

y/σy

Now we can determine the angle necessary to rotate the set of their normalized tweet locations in two main clusters. of points for user i so that the the resulting principal axis is the x-axis. Fig. S12 shows the results of this step. We see that the principal axis for user A and user B is now the x-axis.
4 User A User B 3 Latitudinal Distance (km) from Expected Location

2

1

0

−1

−2

−16

−14

−12 −10 −8 −6 −4 −2 Longitudinal Distance (km) from Expected Location

0

2

Figure S12: (Color online) The results after rotating the locations of User A and User B. We see that they now both have principal axes of trajectory pointing due west. The final step is to normalize for individuals with different gyradius. We accomplish this by dividing the x-coordinate of each rotated tweet location for user i by σx , where σx is the standard deviation of the x-coordinates of the rotated tweet locations for user i, and similarly dividing by σy for the ycoordinates. The final result is shown in Fig. S13.
8 User A User B 6

4

2 y/σy 0 −2 −4 −6 −3.5

−3

−2.5

−2

−1.5 x/σx

−1

−0.5

0

0.5

Figure S13: (Color online) The rotated tweet locations of User A and User B after normalizing for gyradius. The origin represents the center of mass of the respective individuals’ trajectory, namely pa from equation (2). As a result, we can compare the shape of the trajectories for User A and User B having normalized for direction and gyradius. We can see that both User A and User B have most 23

Sign up to vote on this title
UsefulNot useful