Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
4Activity
0 of .
Results for:
No results containing your search query
P. 1
The Geography of Happiness: Connecting Twitter sentiment and expression, demographics, and objective characteristics of place

The Geography of Happiness: Connecting Twitter sentiment and expression, demographics, and objective characteristics of place

Ratings: (0)|Views: 4,682|Likes:
Published by Sarah Blaskovich
Lewis Mitchell, Morgan R. Frank, Kameron Decker Harris, Peter Sheridan Dodds, and Christopher M. Danforth
Lewis Mitchell, Morgan R. Frank, Kameron Decker Harris, Peter Sheridan Dodds, and Christopher M. Danforth

More info:

Published by: Sarah Blaskovich on Feb 20, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/19/2013

pdf

text

original

 
The Geography of Happiness: Connecting Twitter sentiment and expression,demographics, and objective characteristics of place
Lewis Mitchell,
1,
Morgan R. Frank,
1,
Kameron Decker Harris,
1,
Peter Sheridan Dodds,
1,
and Christopher M. Danforth
1,
1
Department of Mathematics & Statistics, Vermont Complex Systems Center,Computational Story Lab, & the Vermont Advanced Computing Core,The University of Vermont, Burlington, VT 05401.
(Dated: February 20, 2013)We conduct a detailed investigation of correlations between real-time expressions of individualsmade across the United States and a wide range of emotional, geographic, demographic, and healthcharacteristics. We do so by combining (1) a massive, geo-tagged data set comprising over 80 millionwords generated over the course of several recent years on the social network service Twitter and (2)annually-surveyed characteristics of all 50 states and close to 400 urban populations. Among manyresults, we generate taxonomies of states and cities based on their similarities in word use; estimatethe happiness levels of states and cities; correlate highly-resolved demographic characteristics withhappiness levels; and connect word choice and message length with urban characteristics such aseducation levels and obesity rates. Our results show how social media may potentially be used toestimate real-time levels and changes in population-level measures such as obesity rates.
I. INTRODUCTION
With vast quantities of real-time, fine-grained data,describing everything from transportation dynamics,resource usage, and social interactions, the science of cities has entered the realm of the data-rich fields. Whilemuch work and development lies ahead, the opportu-nity to scientifically engage with urban phenomena hasnow become broadly available to quantitatively-mindedresearchers[5]. And with over half the world’s populationnow living in urban areas, and this proportion continu-ing to grow, cities, long central to human society, willonly become increasingly more so[22]. Our focus hereconcerns one of the many important questions we are ledto continuously address about cities: how does living inurban areas relate to well-being? Such an undertaking ispart of a general program seeking to quantify and explainthe evolving cultural character—the stories—of cities, aswell as geographic places of larger and smaller scales.Numerous studies on well-being are published everyyear. The UN’s 2012 World Happiness Report attemptsto quantify happiness on a global scale using a ‘GrossNational Happiness’ index which uses data on rural-urban residence and other factors [28]. In the US, Gallupand Healthways produce a yearly report on the well-being of different cities, states and congressional dis-tricts [19], and maintain a well-being index based on con-tinual polling and survey data[3]. Other countries arebeginning to produce metrics measuring well-being: in
Electronic address:lewis.mitchell@uvm.edu;Corresponding author
Electronic address:morgan.frank@uvm.edu
Electronic address:kamerondeckerharris@gmail.com
§
Electronic address:peter.dodds@uvm.edu
Electronic address:chris.danforth@uvm.edu
2012, surveys measuring national well-being and how itrelates to both heath and where people live were conduct-ed in both the United Kingdom by the Office of NationalStatistics[4,27] and in Australia by Fairfax Media and Lateral Economics[16].While these and other approaches to quantifying thesentiment of a city as a whole rely almost exclusivelyon survey data, there is now a range of complementary,remote-sensing methods available to researchers. Theexplosion of the amount and availability of data relat-ing to social network use in the past 15 years has drivena rapid increase in the application of data-driven tech-niques to the social sciences and sentiment analysis of large-scale populations.Our overall aim in this paper is to investigate howgeographic place correlates with and potentially influ-ences societal levels of happiness. In particular, afterfirst examining happiness dynamics at the level of states,we will explore urban areas in the United States in depth,and ask if it is possible to (a) measure the overall aver-age happiness of people located in cities; and (b) explainthe variation in happiness across different cities. Ourmethodology for answering the first question uses wordfrequency distributions collected from a large corpus of geolocated messages or ‘tweets’ collected from Twitter,with individual words scored for their happiness indepen-dantly by users of Amazon’s Mechanical Turk service[2].This technique was introduced by Dodds and Danforth(2009)[10] and greatly expanded upon in Dodds et al.(2011) [11], as well as tested for robustness and sensi-tivity. In attempting to answer the second question of happiness variability, we examine how individual wordusage correlates with happiness and various social andeconomic factors. To do this we use the ‘word shift graph’technique developed in[10,11], as well as correlate word usage frequencies with traditional city-level census sur-vey data. As we will show, the combination of these
  a  r   X   i  v  :   1   3   0   2 .   3   2   9   9  v   2   [  p   h  y  s   i  c  s .  s  o  c  -  p   h   ]   1   9   F  e   b   2   0   1   3
 
2techniques produces significant insights into the charac-ter of different cities and places.We structure our paper as follows. In SectionII, wedescribe the data sets and our methodology for measur-ing happiness. In SectionIIIwe measure the happinessof different states and cities and determine the happiestand saddest states and cities in the US, with some anal-ysis of why places vary with respect to this measure. InSectionIVwe compare our results for cities with censusdata, correlating happiness and word usage with commoneconomic and social measures. We also use the word fre-quency distributions to group cities by their similaritiesin observed word use. We conclude with a discussion inSectionV.
II. DATA AND METHODOLOGY
We examine a corpus of over 10 million geotaggedtweets gathered from 373 urban areas in the contigu-ous United States during the calendar year 2011. Thiscorpus is a subset of Twitter’s garden hose feed, andrepresents roughly 10% of all geotagged tweets postedin 2011. Urban areas are defined by the 2010 UnitedStates Census Bureau’s MAF/TIGER (Master AddressFile/Topologically Integrated Geographic Encoding andReferencing) database [9]. See Appendix A for a moredetailed description of the data set as well as an explo-ration of the relationship between area and perimeter, orfractal dimension, of these cities.To measure sentiment (hereafter happiness) in theseareas from the corpus of words collected, we use theLanguage Assessment by Mechanical Turk (LabMT)word list (available online in the supplementary materialof [11]), assembled by combining the 5,000 most frequentwords occurring in each of four text sources: GoogleBooks (English), music lyrics, the New York Times andTwitter. A total of roughly 10,000 of these individualwords have been scored by users of Amazon’s Mechan-ical Turk service on a scale of 1 (sad) to 9 (happy),resulting in a measure of average happiness for each givenword[23]. For example, ‘rainbow’ is one of the happiestwords in the list with a score of 
h
avg
= 8
.
1
, while ‘earth-quake’ is one of the saddest, with
h
avg
= 1
.
9
. Neutralwords like ‘the’ or ‘thereof’ tend to score in the middleof the scale, with
h
avg
= 4
.
98
and
5
respectively.For a given text
containing
unique words, we cal-culate the average happiness
h
avg
by
h
avg
(
) =
i
=1
h
avg
(
w
i
)
i
i
=1
i
=
i
=1
h
avg
(
w
i
)
 p
i
(1)where
i
is the frequency of the
i
th word
w
i
in
for whichwe have a happiness value
h
avg
(
w
i
)
, and
p
i
=
i
/
i
=1
i
is the normalized frequency of word
w
i
.Importantly, with this method we make no attempt totake the context of words or the meaning of a text intoaccount. While this may lead to difficulties in accuratelydetermining the emotional content of small texts, we findthat for sufficiently large texts this approach nonethe-less gives reliable (if eventually improvable) results. Ananalogy is that of temperature: while the motion of asmall number of particles cannot be expected to accu-rately characterize the temperature of a room, an averageover a sufficiently large collection of such particles definesa durable quantity. Furthermore, by ignoring the contextof words we gain both a computational advantage and adegree of impartiality; we do not need to decide
a pri-ori 
whether a given word has emotional content, therebyreducing the number of steps in the algorithm and hope-fully reducing experimental bias.Following Dodds et al. (2011), for the remainder of this paper, we remove all words
w
i
for which the hap-piness score falls in the range
4
< h
avg
(
w
i
)
<
6
whencalculating
h
avg
(
)
. Removal of these neutral or ‘stopwords has been demonstrated to provide a suitable bal-ance between sensitivity and robustness for our ‘hedo-nometer’ [11]. Further details on how we preprocessedthe Twitter data set can be found in Appendix A.We will correlate our happiness results with censusdata taken from the American Community Survey 1-year estimates for 2011, accessed online at
.
III. HAPPINESS ACROSS STATES ANDURBAN AREAS
We first examine how happiness varies on a somewhatcoarser scale than we will focus on for the majority of thispaper, by plotting the average happiness of all states inthe US in figure1. To avoid the problem that some stateshave happier names than others (for example, Hawaii),we removed each state name from the calculation for
h
avg
.We remark first that at such a coarse resolution thereis little variation between states, which all lie between0.15 of the mean value for the entire United States of 
h
avg
= 6
.
01
. The happiest state is Hawaii with a scoreof 
h
avg
= 6
.
17
and the saddest state is Louisiana witha score of 
h
avg
= 5
.
88
. Hawaii emerges as the happieststate due to an abundance of relatively happy words suchas ‘beach’ and food-related words, but also because of thepresence of the word ‘hi’. This is most likely because of an increased use of Hawaii’s state code ‘HI’ in geotaggedtweets, and will somewhat bias the results. However, wechose not to remove this word from the data set becauseits use in place of ‘hello’ will contribute to the happinessscore in other states, and the rich variety of happy wordsoccurring in Hawaii paints a convincing picture of it as ahappy state regardless of this small bias. A similar resultshowing greater happiness and a relative abundance of food-related words in tweets made by users who regular-ly travel large distances (as would be the case for manyof the tweets emanating from Hawaii) has been reportedin[18]. Louisiana is revealed as the saddest state pri-marily as a result of an abundance of profanity relative
 
3
FIG. 1: Choropleth showing average word happiness for geotagged tweets in all US states collected during the calendar year2011. The happiest 5 states, in order, are: Hawaii, Maine, Nevada, Utah and Vermont. The saddest 5 states, in order, are:Louisiana, Mississippi, Maryland, Delaware and Georgia. Word shift plots describing how differences in word usage contributeto variation in happiness between states are presented in Appendix B (online).
to the other states, in stark contrast with the findings of Oswald and Wu[25] that Louisiana exhibited the highestscore on an alternate measure of life satisfaction.We can further use this data on word frequencies tocharacterize similarities between states based on wordusage. Figure2shows the linear correlation betweenword frequency vectors
=
{
i
,i
= 1 : 50000
}
for eachpair of states, with red entries in the matrix indicatingstates with similar word use. We see some clusters whichmight be explained by geographical proximity, such asVermont and New Hampshire or Louisiana and Mississip-pi, and some outliers such as the state of Nevada, whichcorrelates the lowest on average with all other states.Additional details on this state-level dataset, includingplots of raw number of tweets and number of tweetsper head of population for each state can be found inAppendix A. Word shift graphs showing which wordscontribute most to the variation in happiness acrossstates can be found in Appendix B (online)[1].We now change our resolution to a finer scale byfocussing on cities rather than states. As an illustra-tion of the resolution of the data set as well as our tech-nique, we plot a tweet-generated map of a city, showinghow average word happiness varies with location. In fig-ure3we plot tweets collected from the New York Cityarea during 2011. Each point represents an individualtweet, and is colored by the happiness
h
avg
of the text
consisting of the
= 200
closest LabMT words tothe location of that tweet. We set a maximum thresholdradius of 
r
= 500
meters around each tweet location; if 200 LabMT words cannot be found within that radiusthen the point is colored black. Several features canimmediately be discerned in this purely tweet-generatedmap. Firstly, the spatial resolution reveals the outlineof Manhattan, as well as Central Park, individual streetsand bridges, and even airport terminals such as those atJFK and Newark airports at the lower right and centerleft of the figure respectively. Secondly, we can discern

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->