Alajajian 2015 A

The Lexicocalorimeter:
Gauging public health through caloric input and output on social media
Sharon E. Alajajian,1 Jake Ryland Williams,2 Andrew J. Reagan,1 Stephen C. Alajajian,3 Morgan
R. Frank,4 Lewis Mitchell,5 Jacob Lahne,6 Christopher M. Danforth,1 and Peter Sheridan Dodds1,
1
Department of Mathematics & Statistics, Vermont Complex Systems Center,
Computational Story Lab, & the Vermont Advanced Computing Core,
The University of Vermont, Burlington, VT 05401.
2
School of Information, University of California Berkeley,
102 South Hall #4600, Berkeley, CA 94720-4600.
3
Women, Infants, and Children, East Boston, MA 02128.
4
Media Lab, Massachusetts Institute of Technology, Cambridge, MA, 02139
arXiv:1507.05098v4 [physics.soc-ph] 10 Jan 2017
5
School of Mathematical Sciences, North Terrace Campus, The University of Adelaide, SA 5005, Australia
6
Culinary Arts and Food Science, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104.
(Dated: January 11, 2017)
We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the
caloric content of social media and other large-scale texts. We do so by constructing extensive yet
improvable tables of food and activity related phrases, and respectively assigning them with sourced
estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of caloric
input, caloric output, and the ratio of these measures are all strong correlates with health and
well-being measures for the contiguous United States. Our caloric balance measure in many cases
outperforms both its constituent quantities; is tunable to specific health and well-being measures
such as diabetes rates; has the capability of providing a real-time signal reflecting a populations
health; and has the potential to be used alongside traditional survey data in the development of
public policy and collective self-awareness. Because our Lexicocalorimeter is a linear superposition of
principled phrase scores, we also show we can move beyond correlations to explore what people talk
about in collective detail, and assist in the understanding and explanation of how population-scale
conditions vary, a capacity unavailable to black-box type methods.
I. INTRODUCTION the Gallup Well-Being Index, which is based on factors

such as life evaluation, emotional health, physical health,
Online instruments designed to measure social, psycho- healthy behavior, work environment, and basic access to
logical, and physical well-being at a population level are necessary resources [4]; and (2) the Living Conditions
becoming essential for public policy purposes and public measure developed by the United States Census Bureau,
health monitoring [1, 2]. These data-centric gauges both which is derived from housing conditions, neighborhood
empower the general public with information to allow conditions, basic needs met, a full set of appliances,
comparisons of communities at all scales, and natural- and access to help if needed [6].
ly complement the broad, established set of more read- While such measures will always have their place, we
ily measurable socioeconomic indicators such as wage venture that we must resist oversimplification. The dash-
growth, crime rates, and housing prices. board of society should be just thata rich set of incom-
Overall well-being, or quality of life, depends on patible instruments whose informational content may be
many factors and is complex to measure [3]. Existing observed individually and in total, not unlike the required
techniques for estimating population well-being range input needed for flying a plane where knowledge of just a
from traditional surveys [1, 4] to estimates of smile-to- single number representing things are going well would
frown ratios captured automatically on camera in pub- be untenable. The construction of data-centric instru-
lic spaces [5], and vary widely in the types of data they ments for social systems that deliver more direct, inter-
amass, collection methods, cost, time scales involved, and pretable measures is therefore of great importance as we
degree of intrusion. Partly in response to policy makers move forward into the age of ubiquitous (but not com-
desire for simple one number quantification of complex plete) measurement.
systemsarguably a general human proclivitymany
With the explosive growth of online activity and social
measures are composite in nature. Two examples are (1)
media around the world, the massive amount of real-
time data created directly by populations of interest has
become an increasingly attractive and fruitful source for
analysis. Despite the limitation that social media users
salajajian@asc.upenn.edu, lewis.mitchell@adelaide.edu.au,
jake.williams@drexel.edu, jl3542@drexel.edu, in the United States are not a random sample of the US
andrew.reagan@uvm.edu, chris.danforth@uvm.edu, population [7], there is a wealth of information in these
stephenalajajian@gmail.com, peter.dodds@uvm.edu. data sets and uneven sampling can often be accommo-
mrfrank@mit.edu, dated.
Typeset by REVTEX
2
Indeed, online activity is now considered by many ease at the county level. They were able to show in par-
to be a promising data source for detecting health ticular that the expression of negative emotions such as
conditions [8, 9] and gathering public-health informa- anger on Twitter could be taken as a kind of risk factor
tion [10, 11], and within the last decade, researchers have at the population scale.
constructed a range of online public-health instruments On a US county level, Culotta [20] found that Twitter
with varying degrees of success. The maturing of these activity provided a more fine-grained representation
and related instruments along with theoretical models of community health than demographics alone with the
will ultimately fundamentally inform the limits of char- prevalance of particular words that indicate, for example,
acterization and predictability of social systems. television habits, or negative engagement.
In the next two subsections, we cover related research Finally, in work directly related to our present study,
and then describe our approach to measuring the caloric Abbar et al. [21] have recently performed a similar analy-
content of text. sis of translating food terms used on Twitter into calories.
They found a correlation between Twitter calories and
obesity and diabetes rates for the US, and explored how
A. Previous work food-themed interactions over social networks vary with
connectedness, finding suggestions of social contagion.
For a general overview of work relevant to our present While our approach and results are largely sympathet-
effort, we briefly summarize related research concerning ic, our work incorporates estimates of physical activity
public health and well-being in connection with a range which we will show provides essential extra information
of social media and online data sets. regarding health; introduces a phrase extraction method
In the difficult realm of predicting pandemics [12], we call serial partitioning; and leads to an online imple-
Google Flu Trends [13] enjoyed early success and acclaim. mentation, paving the way for a real-time instrument as
Initially based very simply on search terms, the instru- part of our proposed panometer. We also note that we
ment proved unsurprisingly to be imperfect and in need carried out our work concurrently and independently.
of a more sophisticated approach [14].
In work by several of the current authors and col-
B. Lexicocalometrics
leagues, Mitchell et al. measured the happiness of tweets
across the US and found strong correlations with other
indices of well-being at city and state level, such as the From the preceding list of studies, it has become clear
Gallup Well-being Index; the Peace Index; the Americas that we can estimate population-scale levels of health and
Health Ranking composite index of Behavior, Commu- well-being through social media. Here, we examine the
nity and Environment, Policy and Clinical Care metrics; words and phrases people post publicly about food and
and gun violence (negative correlation) [15]. Using the physical activity on Twitter on a statewide level for the
same instrument in 10 languages, the Hedonometer, we contiguous United States (48 states along with the Dis-
have also shown that the emotional content of tweets trict of Columbia). As we explain fully below in Sec. II A
tracks major world events [2, 16]. and Methods and Materials, Sec. IV, we group categori-
Paul and Dredze found that states with higher obe- cally similar words and phrases into lemmas, and we then
sity rates have more tweets about obesity, and states assign caloric values to these lemmas using the terms and
with higher smoking rates have more tweets about can- notation caloric input for food, Cin , and caloric out-
cer [11]. They also found a negative correlation between put for activity, Cout . We define the ratio of caloric
exercise and frequency of tweeting about ailments, sug- output to caloric input to be a third quantity, caloric
gesting Twitter users are less likely to become sick in ratio:
states where people exercise. They further found health Cout
care coverage rates to be negatively correlated with like- Crat = . (1)
lihood of posting tweets about diseases. Cin
Chunara et al. recently found that activity-related While we will focus largely on the three quantities Cin ,
interests on Facebook are negatively correlated with Cout , and Crat , we will also explore caloric difference,
being overweight and obese, while interest in television an alternate combination of Cin and Cout involving a sin-
is positively correlated with the same [17]. gle parameter:
In an analysis of online recipe queries, West et al.
found that the number of patients admitted to the emer- Cdiff () = Cout (1 )Cin , (2)
gency room of a major urban hospital in Washington,
DC for congestive heart failure (CHF) each month was where 0 1. We use phrase shifts [2] to show
significantly correlated with average sodium per recipe how specific lemmase.g., apples, cake with frost-
searched for on the Web in the same month [18]. ing, white water rafting, knitting, and watching
Eichstaedt and colleagues [19] have demonstrated that tv or movie contribute to the caloric texture of states
psychological language on Twitter outperforms certain across the contiguous US. We then correlate all three val-
composite socioeconomic indices in predicting heart dis- ues with 37 measures relating to health and well-being,
3
and we find statistically strong correlations with quan- caloric input for a given text T as:
tities such as high blood pressure, inactivity, diabetes P
levels, and obesity rates. For ease of language, we will Cin (s)f (s| T ) X
Cin (T ) = sSP in
= Cin (s)p(s| T ),
generally speak of phrases rather than lemmas. s f (s| T ) sSin
We have also generated an accompanying online, inter- (3)
active instrument for exploring health patterns through where f (s| T ) is the frequency of phrase s in text T ,
the lens of Twitter calories: the Lexicocalorimeter. An p(s| T ) is the normalized version, and Sin is the set of
initial, fixed version of the instrument may be accessed at all food phrases in our database.
this papers Online Appendices, http://compstorylab. Second, for each tweeted physical activity phrase, we
org/share/papers/alajajian2015a/, with a evolv- use an estimate of the Metabolic Equivalent of Tasks, or
able, production version housed within our larger mea- METs, which we then converted to calories expended per
surement platform http://panometer.org at http:// hour, assuming a weight of 80.7 kilograms, the average
panometer.org/instruments/lexicocalorimeter (all weight of a North American adult [22]. Analogous to
code for these sites can be found at https://github. Cin (T ) above, we then have
com/andyreagan/lexicocalorimeter-appendix). We
note that while our online instrument is based on Twit- Cout (T ) =
X
Cout (s)p(s| T ), (4)
ter, it may in principle be used on any sufficiently large
sSout
text source, social media or otherwise, such as Facebook.
From this point, we structure the core of our paper as where now Sout is the set of all phrases in our activity
follows. In Sec. II, we establish and discuss our findings database.
in depth. Specifically, we: (1) Outline our text analysis We emphasize that both our food and exercise phrase
of a Twitter corpus from 20112012 Sec. II A), reserving data sets and Twitter databases are necessarily incom-
full details for Methods and Materials in Sec. IV; (2) plete in nature. The values of Cin and Cout are thus not
Present caloric maps of the contiguous US contrasting meaningful as absolute numbers but rather have power
the 48 states and DC through histograms and phrase for comparisons. We also acknowledge that our equiva-
shifts (Sec. II B); and (3) Examine how Cin , Cin , Crat , lences are crudee.g., each mention of a specific food is
and Cdiff () correlate with a suite of measures relating naively turned into the calories associated with 100 grams
to health and well-being. In the Supporting Information, of that foodand later on we address our choices in more
we provide a sample of confirmatory figures as well as all depth. Nevertheless, our method is pragmatic yetas we
shareable data sets (e.g., IDs for all tweets). We offer will showeffective, and offers clear directions for future
concluding thoughts in Sec. III. improvement.
For simplicity and ultimately because the results are
sufficiently strong, we did not filter tweets beyond their
geographic location. Tweets may thus come from indi-
II. ANALYSIS AND RESULTS
viduals, restaurants, sports stores, resorts, news outlets,
marketers, fitness apps, tourists, and so on, and fur-
A. Estimating calories from phrases ther improvements and refinements may be achieved by
appropriately constraining the Twitter corpus.
We used all available geotagged tweets from 2011 and Finally, we take the ratio of Cout (T ) to Cin (T ) to
2012 (around 50 million) from a bounding box of the con- obtain the texts caloric ratio Crat (T ). In general, we
tiguous US, using Twitters garden hose sample (which observe that a higher value of Crat (T ) at the population
is a sample of approximately 10% of all tweets, including scale would appear to be intuitively better, up to some
those that are not geotagged) and the geotag feature to limit indicating negative energy balance. We note that
determine from which of the 48 continental states and Crat = 1 is not salient and should not be taken to mean
the District of Columbia each tweet came. From this a population is balanced calorically. As we discuss lat-
sample, we counted the total number of times each food er, using the difference, what we call Caloric Difference,
and physical activity phrase in our database was tweeted a generalization of Cout Cin , generates similar results
about in each of the 48 continental states and the District but, from a framing perspective, we have reservations in
of Columbia (see Sec. IV and Dataset S1 at https:// creating a scale with a 0 point given the approximate
dx.doi.org/10.6084/m9.figshare.4530965.v1 for all nature of our measures.
tweet IDs). We then used these counts to determine the
average caloric input Cin from food phrase tweets and the
average caloric output Cout from physical activity phrase B. Caloric maps of the contiguous US
tweets as follows.
First, we equate each food phrase s with the calories We now move to our central analysis and exploration of
per 100 grams of that food, using the notation Cin (s). how our lexicocalorimetric measure varies geographically.
(We also explored serving sizes but the databases avail- We start with visual representations and then continue
able proved far from complete.) We then compute the on to more detailed comparisons.
4
piz
za
pizza pizza za
piz
Deviation from national foodcaloric avg.

pizza pizza
15
piz
pizza

pizza
za
piz
pi

pizza
zz
pi
za

zz
a
pizza pizza
coo 10
kies
pizza pizza
pizza
pizza pizza
pizza

pizza
pizza
pi
pi

zz
zz
pi
z
a
5
piz

zz
a
a za
a
pizz pizza
pi
piz

zz
pizza
pi
pizza
a
zz
piz
a
pizza z a
pizza

0
pizza
pi
a
piz
zz
za piz

piz
a

za
ice cre
piz
za

za

piz

am
pi

zz
pi

zz
a
a
piz
A: Calories in z
a
10
wa
tc hin
g tv
or
movie
mo vie
o
tv r watc vie r mo
hing hin v o
Deviation from national activitycaloric avg.

watc g tv o
gt 30
wa
watc r mo
hin watc
hing vie
tc hin
tch
g tv tv wa
wa

o r mo
o r mo vie
ing
vie vie r mo
wa
tch
watc g tv o
wa

tc

hing
hin
tv
ing
watc
h
tc
tv or m
in

or
ovie
g
ovie
in

tv

vie or
m
tv
g
m
r mo g tv
watchin

or
tv
o
or
ov
wat tv watching tv or movie

hing
or
m
m

ie
chin watc
ov
ov
g tv watc watc 20
ov
i
or m hin
e
ie
hin g tv o
ovie
watc
i
g tv
e
ovie hin
o r mo
watc g tv r mo
wa
o vie vie
wa

hin r mo watchin
tv or m
g tv
tc
vie
wa
ie g tv or
t

or m watching tv or movie
h
ch
ovie ov watchin movie

in
tc
in
m g tv or
g
h
wa

r movie
in
e
tv
vi

o
tv
g
wa
o
tc
ovie v
or

t m
g
tv
h
or
or watc
or m g
t
in
watchin
m
ch
hing
hin
or
g tv tv
g
vie
ov

wa tv or
r mo
in
oatvcie

ng
ov
chin
m
tv
o i tch movie
i
m
g
wat
e
g tv r h
ov
o
i

10
or
w w atc
e
tv ing
tv
hin g
watc in
i
h
e
w
atchin
m
watc
or
tv
ov
g tv or
m
or m mo
wa
ov
e
wa
watching tv or movie
ovie vie
i
tch
watc
e
vie
t ch
ie hing
ov mo
wa
in
tv
ing
o or
watchin

ie
r mo
g
m
or vie
tv
tc
ov
tv

tv
gt
v ing
hin
or
in tch
or
tch a
g
or
w
ov
mo
wa

tv
g
0
tv

e
wa
tv or m
or

vie
g
wa
tc
hin
mo
h
t ch
in
tc
g
vie
in

wa
tv
wa
ovie

tv
tch
or
or
ing
m
ov
tv
B: Calories out or
ov
ie
m
i e
ov
ie
10
FIG. 1. Choropleth maps indicating (A) caloric input Cin and (B) caloric output Cout in the contiguous United States
(including the District of Columbia) based on 50 million geotagged tweets taken from 20112012. For both maps, darker means
higher values as per the color bars on the right. The histograms in Figs. 5, S2, and S3 show the specific rankings according
to these two variables and also Crat (see Fig. 3). The overlaid phrase lemmas are the most dominant contributors to Cin and
Cout almost universally pizza and watching tv or movie.
In Fig. 1, we show two choropleth maps of our overall green.

20112012 measures of Twitters caloric input Cin and
caloric output Cout . For both maps and those that fol- These maps immediately allow for some basic obser-
low, quantities increase as colors move from light to dark vations which we will delve into and harden up as our
analysis proceeds. For the food calories map, we see Cin
5
ap
ple
s
pasta r
baco ste
pean
n lob
Deviation from national foodcaloric avg.

ut bu baco 15
tter n
ch
no
r
lobste
oc

gree
od

n be
o
les
an
lat
s
pa

no

app les
e
st

od
a
ca
n corn
baco

le

nd
s
coo 10
y
kies apple
chee s
butt e se
cook r
ies cookies
butter
s

pe
cookie
no
crab
eg

s
an
od
ie
g
ok 5
u
l
tb
es
s co s
dle
to
dle
ut
noo o
crab
m

te
no
ies tter no
at
ut bu
r
cook

od
o
pean eg
le
g
s
butt e
r
choc corn
0
y
olate nd
ca
do

cand
y te

eg
nu
y
te
ola

d
t
ts
c
g
bu
an
o
ch
gri
c

cake
te

ts
ola

ch
oc
oc
do
ch
ol

nu
at

ts
e
ca
cra
A: Calories in
nd
b
y
10
ski
ing
ing
ing
runn bik
Deviation from national activitycaloric avg.

skiin in 30
g ta
un
skiin
g
runn
ing mo
run
in g
lay
walk

runn
nin
ing
ing
bi

ru

ki
running
do

nn
ng

ing using treadmill
runn
wn
in
runn watc 20
ing watc hin
hin g tv g tv o
runn
o r mo r mo
ing vie vie
wa
ice s getting
katin ie my nails
t
sitting
ch
g ov

running
talking done
da

in
m on pho
nc
or

ne
sk
tv
in
wa

tv

ng
iin
g
or
g
t
ing ri
ch
da
hin
g
e
m
eatin
runn

ow
in
nc

vcie
ov
g
g r mwoat sh
g
unnin
in
o
i

10
e
r tv
tv
g
g
hin ea
watc
or
eatin tin
g
m
g
ov
sitting
i e
runn
ing
ea
ng
hik
g ati

tin

in
nc e
ing
da
ea
g
eating

tin
tin

0
g
ea
ea

ea
tin

tin
g
run
nin
B: Calories out g
10
FIG. 2. The same choropleth maps for Cin and Cout presented Fig. 1 but now with phrases whose increased usage contribute
the most to a populations Cin and Cout differing from the overall averages of these measures. See Sec. II D. For example, tweets
from Vermont, which was above average for both Cin and Cout for 20112012, disproportionately contain bacon and skiing.
Michigan was above average for Cin and below for Cout in 20112012, and the most distinguishing phrases are chocolate
candy and laying down. See Figs. 5, S2, and S3 for ordered rankings.
is generally largest in the Midwest and the south while with the highest caloric output according to our measure
Colorado and Maine stand out as states with the lowest appearing in the three-state block of Wyoming, Colorado,
calories. and Utah, as well as Vermont. Tweet-based caloric out-
We see a different texture in the activity calories map put drops to a low in Mississippi and the surrounding
6
states, while Michigan also appears to have a low value
Deviation from national avgs. caloric Ratio

of Cout .
0.10
For the food and activities maps in Fig. 1, we also
show the most dominant phrase for each populations Cin
and Cout scores. Almost uniformly, pizza (high calorie 0.05
food) and watching tv or movie (low calorie activity)

are the lemmas with the largest contributions, a func-
tion of both volume and caloric scores. Only Mississippi 0.00
(ice cream) and Wyoming (cookies) are exceptions,

though pizza is still near the top for both. 0.05
In Fig. 2, we present the same choropleth maps

from Fig. 1, but now with the phrase most distinguish-
ing a population. Specifically, we show phrases whose
increased prevalence most contributes to moving a popu- FIG. 3. Choropleth for caloric ratio Crat = Cout /Cin . See
lations Twitter calorie scores away from the overall aver- Figs. 5, S2, and S3 for ordered rankings.
age for the contiguous US. For example, if a populations
Cin is above average, we find the food phrase whose fre-
quency coupled with its caloric content most strongly Now, we do not pretend that these phrases all come
moves the populations Cin up from the average. (We from individuals diligently recording their present meals
explain in full how we determine these phrases later with or activities. Apart from tweets from individuals, our
phrase shifts in Sec. II D.) We now see a diverse spread of database contains tweets from companies, advertisers,
terms. We find a number of phrases make for reasonable resorts, and so on. And some phrases are problemat-
representations: ic in their generality of meaning, most especially run-
ning (the word run currently has the most meanings
lobster in Maine and Massachusetts; in the Oxford English Dictionary). Nevertheless, as we
dig deeper into all the phrases found for a particular
grits in Georgia; state, we will continue to find commonsensical lexical
patterns.
skiing in Vermont, New Hampshire, and Utah;
In Fig. 3, we show a choropleth map for caloric ratio,
and running in Colorado and a number of other Crat . We see that the highest values of Crat are found in
locations. Colorado, Wyoming, and Vermont, and secondarily for
Maine, Minnesota, Oregon, and Utah. Low values of Crat
Prototypical unhealthy foods rise to the top in various appear in the region comprising Mississippi, Louisiana,
states: Alabama, and Arkansas, as well as West Virginia.
An initial visual comparison of of Figs. 1 and 3, sug-
donuts in Texas; gest that Cout is more well aligned with Crat than Cin .
The reason is that for the present version of the Lexic-
cake in Mississippi; ocalorimeter, Cout has a larger dynamic range than Cin ,
roughly 250 to 285 versus 160 to 210 giving ratios of
chocolate candy in Louisiana; 210 285
160 ' 1.31 and 250 ' 1.14. We could assert that Cin is
and cookies in Indiana. fundamentally less informative but:
By contrast, a few virtuous foodstuffs appear such as 1. In Sec. II E, we will find that some measures relat-
green beans in Oregon and tomato in California. ing to health and well-being correlate more strongly
Our activity list also includes some rather low intensity with Cin and some with Cout ;
ones and we see: 2. We may adjust the dynamic range of either measure
eating rising to the top in Texas, the south, and by rescaling, introducing a kind of tunability [2] to
a number other states; the instrument (a feature we will reserve for future
iterations); and
watching tv or movie in Pennsylvania and else-
3. Because our food phrase database is a factor of 10
where;
smaller than our activity phrase one, revisions of
sitting in Tennessee; our instrument may elevate the power of Cin .
talking on the phone in Delaware; To provide some support for point 1, we compare Cout
and Cin in Fig. 4 (see also Fig. S1). Importantly, we
getting my nails done in New Jersey; see that the two measures are indeed not well correlat-
ed, indicating they contain different kinds of informa-
and simply lying down in Michigan. tion (Pearson correlation coefficient p ' 0.13, p-value
7
all bars are relative to the overall average of the specific

210 WY
^ : 0.13
p
measure. Numeric rankings for each measure are given
pval: 0.39 next to each bar. In Figs. S2 and S3, we present the same
CO

VT
m: 1.64 histograms re-sorted respectively by Cin and Cout .

200
As was indicated by our inspection the choropleth

maps, we do indeed see that Crat is more strongly driv-
UT
MT en by Cout than Cin due to the formers larger dynamic

range. The states with the highest values of Crat achieve
190
OR
SD
their scores through high levels of Cout but more vari-

MN
Cout
CA
NYWA

NE IA

able levels of Cin . Wyoming (23), Vermont (21), and
ME

NH

AZ ID NM ND

Utah (25) are all middling in Cin while Colorado (48)
180
WI

RI
OK
KS

and Maine (49) have the lowest ranks for caloric intake.
FL MA
NV
MO
IN
At the trailing end, we see by contrast that low activity

DC

VA
IL OH
ranks are coupled with high ranks for caloric intake.

CT
PANJ TX KY

TN
NC
SC MI WV A few of the more anomalous states are both evident

170
MD
GA
ARAL
in the Cin and Cout histograms and as those appearing
DE
LA
farthest away from the best line of fit in the scatter plot
of Fig. 4. South Dakota has both high values of Cin and
160
MS
Cout (ranks of 1 and 7) that arrange to give it a ranking

255 260 265 270 275 280 285 of 25 for Crat . Maryland ranking 42nd and 45th in Cin
Cin and Cout , is the only state in the bottom 10 of both
measures.
FIG. 4. Plots for the contiguous US showing the lack

of correlation between caloric input Cin and caloric output D. Phrase shifts
Cout , demonstrating their separate value as they bear differ-
ent kinds of information. The Pearson correlation coefficient In our work on measuring happiness, we have devel-
p is -0.13 and the best line of fit slope is m = -1.64. Fig. S1 oped and extensively used word shifts to show which
adds plots of Crat as a function of Cin and Cout .
words make a given text appear more positive than
another text in aggregate (see [2] and [16]). Such visu-
alizations not only provide our necessary test, but also
= 0.39). This demonstrates why we might expect Cin allow us to draw insight from the lexical tapestry of
or Cout to separately correlate more strongly with other texts. Here, we will explain and use analogously con-
population-level measures, and justifies forming a dash- structed phrase shifts for both Cin and Cout to examine
board using both Cin and Cout as well the composite the states at the extremes of our Crat rankings, Col-
measure of Crat . orado and Mississippi. Interactive food and activity
Regarding point 2 above, we have evidently made a phrase shifts for the 49 regions of the contiguous US form
number of choices in computing Cin and Cout that mean a central part of our online Lexicocalorimeter: http:
we have already introduced an arbitrary tuning of the //panometer.org/instruments/lexicocalorimeter.
ratio Crat (e.g., assuming 100 grams of a food and an We start with two texts: a base reference text Tref ,
hours worth of activity). Having no principled way of and a comparison text Tcomp which we wish to compare
rescaling (i.e., one that is not a function of the data set to Tref . In this paper, we will use the Contiguous US as
being studied), we have chosen to leave the measures as the reference text (weighting the phrase distributions of
computed. As we discuss later, in future iterations we each state equally), but in principle any text can be used
envisage for the Caloric Difference version that introduc- (e.g., in comparing two states, one would be selected as
ing tunability of the dynamic ranges of Cin and Cout a reference). Our interest is in determining which words
altering the bias of the measure toward food or activity or phrases most contribute to or go against the difference
will allow the Lexicocalorimeter to be refined for a range in estimated calories. Ci/o (Tcomp ) Ci/o (Tref ) where i/o
of purposes such as estimating correlates of diabetes lev- stands for in or out. Following [2] and using Eq. (3), we
els versus cancer rates (see Sec. II E). can express the difference as
Ci/o (Tcomp ) Ci/o (Tref )
X h i
C. Rankings for the contiguous US = Ci/o (s) p(s| Tcomp ) p(s| Tref )
sSi/o
Having taken in the maps of our three measures Cin , X h (ref)
ih i
Cout , and Crat , we now explore the rankings quantita- = Ci/o (s) Ci/o p(s| Tcomp ) p(s| Tref ) .
sSi/o
tively, first through the histograms shown in Fig. 5. We
order the 48 states and DC by Crat (rightmost plot) and (5)
8
Deviations from national averages

CO, 48 CO, 2 Colorado, 1
WY, 23 WY, 1 Wyoming, 2
VT, 21 VT, 3 Vermont, 3
UT, 25 UT, 4 Utah, 4
ME, 49 ME, 14 Maine, 5
MN, 44 MN, 8 Minnesota, 6
OR, 37 OR, 6 Oregon, 7
NH, 47 NH, 15 New Hampshire, 8
MT, 3 MT, 5 Montana, 9
NY, 36 NY, 11 New York, 10
WA, 33 WA, 13 Washington, 11
CA, 29 CA, 10 California, 12
WI, 45 WI, 21 Wisconsin, 13
AZ, 38 AZ, 18 Arizona, 14
NE, 17 NE, 12 Nebraska, 15
NV, 43 NV, 25 Nevada, 16
DC, 46 DC, 29 District of Columbia, 17
ID, 30 ID, 19 Idaho, 18
FL, 41 FL, 26 Florida, 19
MA, 39 MA, 24 Massachusetts, 20
IA, 7 IA, 9 Iowa, 21
NM, 18 NM, 17 New Mexico, 22
RI, 27 RI, 23 Rhode Island, 23
MO, 40 MO, 28 Missouri, 24
SD, 1 SD, 7 South Dakota, 25
OK, 8 OK, 20 Oklahoma, 26
ND, 4 ND, 16 North Dakota, 27
IL, 34 IL, 31 Illinois, 28
VA, 32 VA, 30 Virginia, 29
KS, 10 KS, 22 Kansas, 30
CT, 26 CT, 33 Connecticut, 31
PA, 24 PA, 34 Pennsylvania, 32
TN, 28 TN, 38 Tennessee, 33
IN, 9 IN, 27 Indiana, 34
NJ, 20 NJ, 36 New Jersey, 35
TX, 13 TX, 35 Texas, 36
SC, 22 SC, 40 South Carolina, 37
NC, 19 NC, 39 North Carolina, 38
MD, 42 MD, 45 Maryland, 39
GA, 35 GA, 43 Georgia, 40
OH, 5 OH, 32 Ohio, 41
MI, 16 MI, 42 Michigan, 42
KY, 6 KY, 37 Kentucky, 43
DE, 31 DE, 47 Delaware, 44
AR, 15 AR, 46 Arkansas, 45
WV, 2 WV, 41 West Virginia, 46
AL, 11 AL, 44 Alabama, 47
LA, 14 LA, 48 Louisiana, 48
MS, 12 MS, 49 Mississippi, 49
15 Food 17 15 Activity 30 0.12 Ratio 0.13
FIG. 5. Histograms of caloric intake Cin (food), caloric output Cout (activity), and caloric ratio Crat for the states of the
contiguous US, all ranked by decreasing Crat . Bars indicate the difference in the three quantities from the overall average with
colors corresponding to those used in Figs. 1, 2, and 3. We provide the same set of histograms re-sorted by Cin and Cout in
Figs. S2 and S3.
We now have a sum contributions due to all phrases. We In Fig. 6, we present food phrase shifts which help to
normalize these contributions as percentages and anno- illustrate why:
tate their structure as follows:
Colorado ranks 48/49 for caloric input Cin
Ci/o (s) = (Fig. 6A),
100 h
(ref)
ih
(comp) (ref)
i
Ci/o (s) Ci/o ps p s , Mississippi ranks 12/49 for caloric input Cin
(comp) (ref)
Ci/o Ci/o | {z }| {z } (Fig. 6B),
+/ /
(6) Colorado ranks 2/49 for caloric output Cout

P (Fig. 6C),
where sSi/o Ci/o (s) = 100. We use the symbols
+/ and / to respectively encode whether the calo- and Mississippi ranks 49/49 for caloric output Cout
ries of a phrase exceed the average of the reference text, (Fig. 6D).
and whether a phrase is being used more or less in the These shifts display phrases that fall into four cate-
comparison text. We call Ci/o (s) the per food/activity gories:
phrase caloric expenditure shift. Finally, we sort phras-
es by the absolute value of Ci/o (s) to create each phrase +, yellow: Phrases representing above average quan-
shift. tities (here calories) being used more
Why Colorado consumes less calories on average: Why Mississippi consumes more calories on average:9
Average US calories = 267.25
Colorado calories = 256.58 (Rank 2 out of 49)
A. Coloradofood: B. Mississippifood:
Mississippi calories = 271.37 (Rank 34 out of 49)
+ + + +
- - - -

1. noodles- 1. cake+
2. chocolate candy+ 2. cookies+
3. bacon+ 3. shrimp-
4. cake+ 4. pineapple-
5. cookies+ 5. pasta-
6. chicken- 6. banana-
7. olive oil+ 7. catfish-
8. pasta- 8. mashed potatoes-
9. shrimp- 9. grits-
Food rank
Food rank
10. apples- 10. chicken-
11. cucumber- 11. sausage+
12. egg- 12. crab-
13. crab- 13. olive oil+
14. tomato- 14. peaches-
15. ice cream- 15. bacon+
16. peaches- 16. apples-
17. turkey- 17. cabbage-
18. pineapple- 18. mango-
19. onion- 19. sweet potato-
20. cabbage- 20. onion-
21. pear- 21. mayonnaise+
22. donuts+ 22. banana pudding-
23. almonds+ 23. pear-
Why Colorado expends -5 more0 24. calories
pistachios+
5 on average: -1 0 24. turkey-
1
Per food
25. phrase
cheese+ caloric shift Why Mississippi expends
Pericefood
25. fewer
phrase
cream- calories
caloric shifton average:
Average US caloric expenditure = 176.60
26. grapes- Average US caloric expenditure = 176.60 26. broccoli-
Colorado caloric expenditure = 203.48 (Rank 2 out of 49)
C. Coloradoactivity: D. Mississippiactivity:
Mississippi caloric expenditure = 161.26 (Rank 49 out of 49)
+ + + +
- - - -

1. running+ 1. running+
2. skiing+ 2. dancing+
3. hiking+ 3. eating-
4. snowboarding+ 4. cooking+
5. biking+ 5. watching tv/movies-
6. eating- 6. laying down-
7. mountain biking+ 7. walking+
Activity rank
8. laying down-
Activity rank
8. biking+
9. white water rafting+ 9. ice skating+
10. rock climbing+ 10. using treadmill+
11. watching tv/movies- 11. swimming+
12. talking on phone- 12. hiking+
13. sledding+ 13. attending church-
14. ice skating+ 14. talking on phone-
15. reading- 15. sitting-
16. playing video games- 16. getting my hair done-
17. walking+ 17. boxing+
18. showering- 18. bowling+
19. jazzercise+ 19. playing football+
20. using treadmill+ 20. golfing+
21. scuba diving+ 21. sledding+
22. getting my hair done- 22. getting my nails done-
23. bowling+ 23. cleaning+
-5 024. mountain
5 climbing+ 24. skiing+
-4-2 0 2 4
Per activity25.phrase caloric expenditure shift
cooking+ Per activity phrase caloric expenditure shift
25. snowboarding+
26. writing- 26. reading-
FIG. 6. Phrase shifts showing which food phrases and physical activity phrases have the most influence on Colorado and
Mississippis top and bottom ranking for caloric ratio, when compared with the average for the contiguous United States.
Note that phrases are lemmas representing phrase categories. Overall, Colorado scores lower on Twitter food calories (257.4
versus 271.7) and higher on physical activity calories (203.5 versus 161.3) than Mississippi. We provide interactive phrase
shifts as part of the papers Online Appendices at http://compstorylab.org/share/papers/alajajian2015a/ and at http:
//panometer.org/instruments/lexicocalorimeter. We explain phrase (word) shifts in the main text (see Eqs. 5 and 6), and
in full depth in [2] and [16] and online at http://hedonometer.org [23].
10
often. Examples: cookies for Mississip- substanceColorado tweets less about eating,
pi in Fig. 6B and rock climbing for Col- laying down, and watching tv or movie.
orado in Fig. 6C.
Fig. 6D: Mississippis low ranking in activity is
-, pale blue: Phrases representing below average quan- largely due to tweeting less about high output
tities being used less often. Examples: activities (+, pale yellow): less running, danc-
watching tv or movie for Mississippi in ing, walking, and biking. The second most
Fig. 6B and laying down for Colorado important category is an increase in low out-
in Fig. 6C. put activity phrases such as eating, attending
church, and talking on the phone.
+, pale yellow: Phrases representing above average quan-
tities being used less often. Examples: In Figs. S4, S5, S6, and S7, we complement the
chocolate candy for Colorado in Fig. 6A four phrase shifts of Fig. 6 by showing the top 23
and running for Mississippi in Fig. 6D. phrases for each of four ways phrases may contribute.
Interactive phrase shifts for all of the contiguous US
-, blue: Phrases representing below average quan-
are housed at http://panometer.org/instruments/
tities being used more often. Examples:
lexicocalorimeter.
reading for Colorado in Fig. 6A and
Overall, we find the lexical texture afforded by our
catfish for Mississippi in Fig. 6B.
phrase shifts is generally convincing, but we expect future
Note that depending on the quantity, higher or lower may improvements in our food and activity data sets will
be better and the four categories flip signs in their sup- iron out some oddities (we again use the example of ice
port. For example, Cin and Cout increase with + phras- cream). We also note that phrase shifts are very sen-
es; after we examine correlations with health and well- sitive and that terms that seem to be being evaluated
being measures in Sec. II E, we will be able to interpret incorrectly may easily be removed from the phrase set,
this as bad for Cin and good for Cout . and that doing so will minimally change the overall score
At the top of each phrase shift, the bars indicate the for sufficiently large texts.
total contribution of each of the four types of phrases,
and the black bar the net change. We see that the four
net changes arise in different ways. E. Correlations with other health and well-being
measures
Fig. 6A: Colorado is lower than average for Cin
largely due to tweeting more about relatively low We now turn to a suite of statistical comparisons
calorie (per 100 grams) foods: noodles, egg, between our three measurescaloric input, caloric out-
pasta, and turkey. We also find less tweets put, and caloric ratioand a collection of demographic,
about high calorie foods such as candy, cake, behavioral, health, and psychological quantities.
and cookies. Going against these phrases, we see We use Spearmans correlation coefficient s to exam-
Colorado does tweet relatively more about bacon ine relationships between Cin , Cout , and Crat and 37
and olive oil, and less about some relatively low- variables variously relating to food and physical activ-
er calorie foods chicken, ice cream, shrimp, ity, Big Five personality traits, and health and well-
and corn. We note that this does not mean these being rankings (a total of 111 comparisons) [4, 6, 2433].
foods are low calorie in absolute terms (ice cream To correct for multiple comparisons, we calculate the q-
is a good example), just that 100 grams of them are value for each correlation coefficient using the Benjamini-
low calorie in comparison to the US baseline. Hochberg step-up procedure [34] (the q-value is to be
Fig. 6B: Mississippi almost equally tweets less interpreted in the same way as a p-value). We then con-
about a variety of low calorie foods, e.g., pas- sider correlations in reference to the standard significance
ta, banana, and crab (pale blue bar) while levels of 0.01 and 0.05.
also tweeting more about the complementary range We must first acknowledge that many of the variables
of such foods including shrimp, peaches, and we test against our measures are highly correlated with
pineapple (dark blue bar). The modest net gain each other. The food and physical activity-related vari-
is mostly due to a small increase in tweeting about ables are in the areas of physical activity levels, produce
high calorie foods such as cake, cookies, and intake and availability rates (including trends in public
sausage. schools), chronic disease rates, and rates of unhealthy
habits. Many of these variables are well known to be
Fig. 6C: For physical activity, tweets from Col- influenced by diet and physical activity (e.g., obesity
orado show a preponderance of relatively high rates [25]), and others may be less directly related (e.g.,
caloric expenditure phrases (+, yellow) includ- percent of cropland in each state harvested for fruits and
ing running, skiing, hiking, snowboard- vegetables [28]).
ing and so on. Tweeting less about low effort To give some grounding for the full set of compar-
activities is the only other contribution of any isons, we show in Fig. 7 how six demographic quantities
11
40

MS
AL
MS
^ : 0.78 ^ : 0.77 ^ : 0.76
35
11

WV
TN
MS
AL
s
TN s s

LA LA
WV

KY
q : 2.7 109 q : 2.7 109 q : 5.4 109
High Blood Pressure

AL
WV
LA
TN
9 10
Physical Inactivity

SC KY OK

OK
AR

AR SC
35
OK GA
TX
30

DE

KY
IN
MI FL
MO
OH
Diabetes

MO NC
MI
AR

RI IN

OH
IN
MD
CA
OH
DE SC
ND
TX FL
SD
GA
NC
ME

GA
NC KS
PA
FL
NJ
MD
PA
RI

IA
NY
NE
MD
TX VA
PA

SD
IL
NY
25

CT
WY KS NY
NV NH

NJ
IL NM
NJ
30
VA
MT IL NM
WA
VA DCAZ
NV
8
NV MT
AZ
CT IA OR
KS MO

MI MA
ID
VT
MA ND

WI
ME NE
NM WI
WY
DE
ID

NH
AZ
WA
ME

WA
MN CA

ID
MA
OR

VT NEWI UT
7
NH
20

MN
ND
IA

OR RI
25

CA
WY

UT
CO SD
CT
MN
MT
6

CO

CO
UT
VT
70
75 76 77 78 79 80 81

LA
MS
MN
^ : 0.73
^ : 0.73

240

MS
AR
CT
CA

WV

AL
OK s
AL
s
MA
NY
VT
OK
q : 3.2 108 q : 2.5 108

NJ UT
NH

DE
KY
ND
SC
SD
WI Heart Disease Death
LA
NE
RI
WI
WA
CO

MI
NC
MO
KS

AR
DC AZ
IA
65
IN
TN
TX
Life Expectancy
OH
NE
TN FL
ND
SD
ID
OR
PA
GA IA

IL
ME KY
WV
ME
Overweight

MD
VA
200

WY
MI
VA
IL

RI

NM
MN
MO
NY KS
MD
NV
ID WA
NV

CT AZ
FL
NH PA
TX
MT
IN
GA
OH
DE
NM

NJ OR
MT
SC
MI
WY

PA
NV

NY
60

CA
VT
MD IL
NJ
TX NC
OH

DE
NC MO
IN

MA IA

VA
WY
RI
GA
160

UT KS
WI
CA
FL SC

ID
ND
CT
SD MT
NE NH
VT
WA
NM ME
DC

CO MA
TN

AZ
55

OR

UT KY
AR
OK ^ : 0.68

CO
LA s

AL
WV
q : 5.8 107
120

DC
MN
MS
0.6 0.7 0.6 0.7 0.6 0.7

Caloric Ratio Caloric Ratio Caloric Ratio
FIG. 7. Six demographic quantities compared with caloric ratio Crat for the contiguous US. The inset values are the Spearman
correlation coefficient s , and the Benjamini-Hochberg q-value. See Tab. I for a full summary of the 37 demographic quantities
studied here.
vary with caloric ratio Crat . We see strong correlations ic disease-related rates were also significantly correlated
with |s | 0.68, and the highest value for Benjamini- with Cin , with the exception of adult diabetes, childhood
Hochberg q-value is 5.8107 . overweight and obesity, and high cholesterol, after cor-
We present a summary of all results in Tab. I where recting for multiple comparisons.
we have ordered and numbered demographic quantities The variables relating to unhealthy habits (smoking
in terms of ascending Benjamini-Hochberg q-values for (#16) and binge drinking rates (#26)) both correlated
Crat . For comparison and to further demonstrate the significantly with all three of our measures with the one
robustness of our approach, in Tabs. S1, S2, and S3), we exception of binge drinking and caloric input. The direc-
reproduce the same analysis with the inclusion of liquids tion of correlations for these two habits are opposite each
and for a differential measure Cdiff () = Cout (1 other (e.g., negative for smoking and Crat , positive for
)Cin , both with and without liquids. Here, we choose binge drinking and Crat ), consistent with recent work on
to set the effective means of Cout and Cin equal across alcohol consumption [35].
the statewide averages (i.e., hCout i = (1 )hCin i), The two variables relating to physical activity rates
resulting in = 0.598. Overall, we find little variation (percent of population that has had no physical activity
in our results whether we use Crat and Cdiff (0.598). in past 30 days (#1), and percent of population that has
Surveying the health-based demographics, we found been physically active in past 30 days (#2)) correlated
Crat was significantly correlated with all chronic disease- significantly with all three of our measures. The two
related rates we tested against (high blood pressure (#3), measures relating to rates of physical and mental health
adult diabetes (#4), adult overweight and obesity (#6), (average number of poor mental health days in past 30
heart disease deaths (#7), adult obesity (#8), childhood days (#24), and average number of poor physical health
overweight and obesity (#13), high cholesterol (#19), days in past 30 days (#27)) correlated significantly with
and colorectal cancer (#22)). All of these but colorectal both Cout and Crat , but did not correlate significantly
cancer rate were also significantly correlated with Cout . with Cin .
Caloric input Cin results were more mixed. Chron- The four variables relating to fruit and vegetable con-
12
s for s for s for

Health and/or well-being quantity q-val q-val q-val
Crat Cin Cout
1. % no physical activity in past 30 days [24] -0.78 2.73 1009 0.58 5.67 1005 -0.66 1.51 1006
2. % have been physically active in past 30 days [24] 0.78 2.73 1009 -0.57 6.53 1005 0.67 1.24 1006
3. % high blood pressure [24] -0.77 2.73 1009 0.32 4.05 1002 -0.78 2.73 1009
4. Adult diabetes rate [25] -0.76 5.44 1009 0.29 6.09 1002 -0.77 2.73 1009
5. CNBC quality of life ranking [26] -0.76 6.75 1009 0.28 7.34 1002 -0.77 3.60 1009
6. % adult overweight/obesity [27] -0.73 3.16 1008 0.55 1.41 1004 -0.59 3.07 1005
7. Heart disease death rate [27] -0.73 2.50 1008 0.34 2.80 1002 -0.73 2.30 1008
8. % adult obesity [25] -0.72 4.30 1008 0.53 2.26 1004 -0.59 2.96 1005
9. Gallup Wellbeing score [4] 0.72 4.69 1008 -0.31 4.43 1002 0.73 3.99 1008
10. Americas Health Rankings, overall [24] -0.72 4.10 1007 0.43 4.74 1003 -0.67 2.77 1006
11. Life expectancy at birth [27] 0.68 5.81 1007 -0.4 6.91 1003 0.65 2.64 1006
12. % who eat fruit less than once a day [28] -0.67 1.20 1006 0.61 1.39 1005 -0.51 5.35 1004
13. % child overweight/obesity [27] -0.64 3.53 1006 0.27 7.55 1002 -0.64 3.20 1006
14. % who eat vegetables less than once a day [28] -0.61 1.39 1005 0.51 5.33 1004 -0.46 1.57 1003
15. Median daily intake of fruits [28] 0.6 1.98 1005 -0.62 8.33 1006 0.41 5.37 1003
16. Smoking rate [27] -0.59 2.96 1005 0.51 5.26 1004 -0.48 1.08 1003
17. Median household income [27] 0.51 5.55 1004 -0.53 3.27 1004 0.4 8.38 1003
18. Median daily intake of vegetables [28] 0.5 6.10 1004 -0.56 7.44 1005 0.31 4.36 1002
19. % high cholesterol [24] -0.49 8.11 1004 0.23 1.45 1001 -0.48 9.05 1004
20. Brain health ranking [29] (lower is better) -0.49 8.11 1004 0.62 1.39 1005 -0.29 5.70 1002
21. % with bachelors degree or higher [6] 0.46 1.57 1003 -0.54 1.66 1004 0.33 2.82 1002
22. Colorectal cancer rate [25] -0.44 4.09 1003 0.53 3.59 1004 -0.27 8.25 1002
23. US Census Gini index score [30] (lower is better) -0.42 5.37 1003 -0.03 8.42 1001 -0.5 5.55 1004
24. Avg # poor mental health days, past 30 days [24] -0.42 5.37 1003 0.12 4.80 1001 -0.48 1.06 1003
25. Neuroticism Big Five personality trait [31] -0.38 1.09 1002 0.2 2.03 1001 -0.37 1.44 1002
26. Binge drinking rate [24] 0.37 1.46 1002 -0.15 3.56 1001 0.41 5.84 1003
27. Avg # poor physical health days, past 30 days [24] -0.35 2.34 1002 0.19 2.19 1001 -0.38 1.13 1002
28. Farmers markets per 100,000 in pop. [28] 0.34 2.72 1002 0.06 7.17 1001 0.42 5.14 1003
29. Strolling of the Heifers locavore score (lower is better) [32] -0.29 5.86 1002 -0.3 5.41 1002 -0.45 2.94 1003
30. Extraversion Big Five personality trait [31] -0.28 6.94 1002 0.03 8.42 1001 -0.29 5.63 1002
31. % schools offering fruit/veg at celebrations [28] 0.24 1.31 1001 -0.46 1.96 1003 0.05 7.90 1001
32. Openness Big Five personality trait [31] 0.23 1.31 1001 -0.5 6.11 1004 0.04 8.10 1001
33. % cropland harvested for fruits/veg [28] 0.19 2.34 1001 -0.62 1.37 1005 -0.04 8.10 1001
34. Conscientiousness Big Five personality trait [31] -0.12 4.81 1001 0.2 2.10 1001 -0.05 7.93 1001
35. % census tracts, healthy food retailer within 1/2 mile [28] -0.03 8.44 1001 -0.52 3.68 1004 -0.24 1.31 1001
36. George Mason overall freedom ranking [33] (lower is freer) -0.03 8.42 1001 -0.11 5.15 1001 -0.1 5.64 1001
37. Agreeableness Big Five personality trait [31] -0.01 9.61 1001 0.22 1.50 1001 0.08 6.47 1001
TABLE I. Spearman correlation coefficients, s , and Benjamini-Hochberg q-values for caloric input Cin , caloric output Cout ,
and caloric ratio Crat = Cout /Cin and demographic, data related to food and physical activity, Big Five personality traits [31],
health and well-being rankings by state, and socioeconomic status, correlated, ordered from strongest to weakest Spearman
correlations with caloric ratio. The two breaks in the table indicate significance levels of 0.01 and 0.05 for the Benjamini-
Hochberg q of Crat , corresponding to the first 24 health and/or well-being quantities and then the next four, numbers 25 to 28.
The bottom 9 quantities were not significantly correlated with Crat according to our tests. Tabs. S1, S2, and S3 present the
same analysis for caloric measures including phrases representing liquids, and for the difference Cdiff () = Cout (1 )Cin ,
both without and with liquids included.
sumption rates all correlated significantly with all three related with Cin but were not correlated with Cout or
of our measures. The variables relating to presence of Crat . Variables relating to local food (number of farmers
produce in the state (percent of cropland in each state markets per 100,000 people (#28) and Strolling of the
harvested for fruits and vegetables (#33), percent of cen- Heifers locavore score (#29)) were not significantly cor-
sus tracts with a healthy food retailer within one-half related with Cin , but were significantly correlated with
mile (#35), and percent of schools offering fruits and Cout .
vegetables at celebrations (#31)) were significantly cor-
Our health and well-being ranking variables included
13
the CNBC quality of life ranking (#5), Gallup Wellbe- should be independent of the particular data set being
ing ranking (#9), Americas Health Ranking overall state studied), we may profit from the versatility of Cdiff ()
rank (#10), life expectancy ranking (#11), Brain Health when focusing on a single demographic. For example,
ranking (#20), Gini index score (#23), and George if we are interested in diabetes rates, we could tune the
Masons overall freedom ranking (#36). Caloric ratio instrument to obtain the best correlation with known lev-
correlated with all of these variables except for George els, and thereby create a real-time estimator. To do so,
Masons freedom ranking (which did not correlate with we would tune and find the value that gives the highest
any of our three measures). Cout correlated significantly correlation between Cdiff () and diabetes rates for a giv-
with all of these measures except for the Brain Health en set of populations. Of course, we could use a black
ranking and the freedom ranking. caloric input Cin did box method to generate a more optimal fit, but in bas-
not correlate significantly with the CNBC quality of life ing our instrument on food and activity words, we have
ranking, Gini index score, or freedom ranking. a far more principled approach that grants us the oppor-
Regarding correlations with the Big Five personality tunity not just to mimic but to understand and explain
traits, Pesta et al. noted that Neuroticism...emerged as patterns that we find. In particular, our word shifts will
the only consistent Big Five predictor of epidemiologic be of great use in showing why our hypothetical estimate
outcomes (e.g., rates of heart disease or high blood pres- of diabetes is varying across populations.
sure) and health-related behaviors (e.g., rates of smoking We fully recognize that the Twitter population is not
or exercise) [36]. Additionally, neuroticism correlates the same as the general population; Twitter users differ
with many health-related variables, including depression from the general population in terms of race, age, and
and anxiety disorders, mortality, coping skill, death from urbanity [7]. However, we currently have no reliable way
cardiovascular disease, and whether one smokes tobac- to know, for example, the true age, race, gender, and
co [36]. Here, in keeping with these observations, we education level of individual users and as such, are not
found that neuroticism (#25) was indeed the only Big able to adjust for these factors. While we were able to
Five personality trait that correlated significantly and vet our food and physical activity lists to some extent
negatively with caloric ratio. (as described in Methods and Materials), we could not
We also tested our three measures against two mea- realistically go through every tweet to be certain that
sures of socioeconomic statusmedian income (#17) and the phrase was being used in the way that we thought.
percent of state with a bachelors degree or higher level We realize that even if the phrases are being used as we
of education (#21)and found these correlations were imagine, it does not necessarily mean that the person
significant for all three of our measures. who tweeted actually performed the physical activity or
ate the tweeted-about food (West et al. address a sim-
ilar issue in inferring food consumption from accessing
III. CONCLUDING REMARKS recipes online [18]).
We also currently do not know at what point our met-
Our Lexicocalorimeter has thus, when applied to Twit- ric breaks down at smaller time scales (e.g., months or
ter, proved to find and demonstrate a range of strong, weeks) or for smaller spatial regions (e.g., city or county)
commonsensical patterns and correlations for the con- level. Our preliminary research shows that the physical
tiguous US. We invite the reader to explore our online activity metric on its own may be quite effective at the
instrument, a screenshot of which is shown in Fig. 8. city level, but the food measure may not be accurate on
Given the complex relationships between health, well- a smaller scale. We have also found the physical activi-
being, happiness, and various measures of socioeconomic ty list to be robust to random partitioning [37], whereas
status, it is rather difficult to say that we are only mea- the food list was not. We believe that these preliminary
suring health or only measuring well-being. We are also findings may be due to several factors: (a) the size of the
measuring socioeconomic status to some extent. Howev- food list (just over 1400 phrases) is much smaller than
er, the correlations between caloric ratio and measures of the physical activity phrase list (just over 13,400 phras-
socioeconomic status are not as strong as the correlation es); (b) there are generally more tweets about physical
of caloric ratio with many of the other measures. Given activities in our list than the foods in our food list; and
the above, we believe that the caloric content of tweets (c) the amount of data within a city may not be a large
can be used successfully, along with other well-being and enough sample for any food-based Twitter metric. We
quality of life measures, to help gauge overall well-being note that we have not tried using the metric on counties
in a population. or Census block or tract groups, and it may be that these
There are many potential forward directions. A are more conducive to the metric.
promising avenue is to incorporate tunability to the Lexi- We propose to use crowdsourcing as a way to build a
cocalorimeter by manipulating its dynamic range. While more comprehensive food phrase list that includes com-
we chose the caloric ratio Crat for its generality in the monly eaten foods with brand names as well as food slang
main body of this work, there is more flexibility in the that we did not capture here. Ideally, we would arrive
measurement of caloric difference: Cdiff () = Cout at a food phrase database similar in scale to that of our
(1 )Cin . Though a universal approach is unclear ( existing physical activity phrase list. However we move
How do I look in these tweets? Gauging well-being through "caloric
content" of tweets
14 Sharon E. Alajajian, Jake R. Williams, Andrew J. Reagan, Stephen C. Alajajian, Morgan R. Frank, Lewis Mitchell, Jacob Lahne,
Christopher M. Danforth, and Peter Sheridan Dodds
Why Vermont consumes more calories on average: Why Vermont expends more calories on average:
Average US calories = 267.92 Average US caloric expenditure = 176.60
Vermont calories = 268.66 (Rank 29 out of 49) Vermont caloric expenditure = 203.22 (Rank 3 out of 49)
+ + + +
Reset
Reset
Caloric Balance - - - -

ME 1. bacon+ 1. skiing+
2. chocolate candy+ 2. running+
WI VT NH 3. onion- 3. snowboarding+
4. donuts+ 4. hiking+
WA ID MT ND MN IL MI NY MA 5. chicken- 5. dancing+
6. apples- 6. sledding+
7. butter+ 7. eating-
OR NV WY SD IA IN OH PA NJ CT RI
Activity rank
8. banana- 8. watching tv or movie-
Food rank
9. noodles- 9. cooking+
CA UT CO NE MO KY WV VA MD DE 10. cookie dough+ 10. cleaning+
11. cake+ 11. using treadmill+
AZ NM KS AR TN NC SC DC 12. coconut oil+ 12. walking+
13. cookies+ 13. biking+
14. broccoli- 14. picking fruit+
OK LA MS AL GA 15. crab- 15. rock climbing+
16. peanut butter+ 16. getting my hair done-
TX FL 17. beef- 17. getting my nails done-
18. shrimp- 18. doing laundry+
19. beet- 19. talking on phone-
20. cucumber- 20. writing-
21. strawberries- 21. playing basketball+
22. walnuts+ visualization by 22. shoveling+ visualization by
23. chicken salad- @andyreagan 23. playing football+ @andyreagan
24. mashed potatoes-

-5 0 5 24.
-10 boxing+
-5 0 5 10
Per food phrase25.caloric
pineapple-shift Per activity phrase caloric expenditure
25. square dancing+ shift
26. olive oil+ 26. ballet dancing+
27. catfish- 27. jumping jacks+
0.10
37
38 ut . Te y
28. grits- 28. cleaning or washing a vehicle+
25 24
26 out isso
32 o rgi
45 4. D hig
. S 3 ers
31 30. nsa
. N h C xa
35 34. sse
.S M
49 Lo am
Caloric Balance
. N h D ur
33 ns icu
. P nn nia
. W e an
27 h D ota
29. lettuce- 29. laying down-
. C Vi s
48 Ala as
42 41. ia
4 ic ky
39 Car na
46 Vir re
. N Ind e
or ar s
43 ent io
47 ka ia
. M is a
. T ylv t
or ak i
en ect
40 ryl a
. O ak
29 llin
es law
th ol
. b
. K Oh
. M oli
. A gin
0.05
. M uc
ew ia
.
28 om
30. girl scout cookie+ 30. ice skating+
en an
.
. G an
is ian
t
. K ois
kl ota
si
.I a
ne ia
a n
u
ah
J a
eo d
31. grapes- 31. climbing stairs+
ss a
ns
rg
ip
a
i
n
e
32. swiss chard- 32. mountain biking+
pi
0.00
33. roasted red pepper- 33. roller skating+
1.
2. lor
3. yom o
4. rmo g
5. ah
6. aine
7. inne
8. ego ta
9. w
10 ont mp
11 ew
12 alif rk
13 as ia
14 isc gto
15 rizo sin
16 eb
17 ev ka
18 ah
19 lori
20 ist
21 as of
22 wa hu lum
23 ho
34. mushrooms- 34. paddleboarding+

C
W ad
Ve in
U
M
M
O so
N n
M Ha
. N an shi
. C Yo
. W or
. W hin
. A on n
. N na
. N ras
. I ad
.F o
. D da
. M rict
. I sa Co
.R
. N de
o
t nt
e
r
d a
-0.05
ew Isl
35. spaghetti squash- 35. jazzercise+

M nd
36. green pepper- 36. mowing grass+

a
ex
a
ic
37. tortilla- 37. attending church-

se b
-0.10
o
tts ia
re
38. baked potato- 38. playing video or computer games-

39. fried eggs- 39. boating-
2 4 6 8 10 12 14 16 18 20 40. tomato-
22 24 26 28 30 32 34 36 38 40 42 44
40. fishing+ 46 48
State Rank
41. cake with frosting+ 41. weight lifting+
42. oysters- 42. reading-
43. sunflower seeds+ 43. doing my hair+
44. tangerines- 44. doing pushups+
45. peanuts+ 45. playing dodgeball+
46. almond joy+ 46. watching tv or movies laying down-
FIG. 8. Screenshot of the interactive dashboard for ourpotato-
47. sweet prototype Lexicocalorimeter site (taken 2015/07/03). 47. vacuuming+An archived
48. pudding- 48. doing power yoga+
development version can be found as part of our papers Online Appendices
49. cheese+
at http://compstorylab.org/share/papers/
49. pole dancing+
alajajian2015a/maps.html, and a full dynamic implementation50. will be part of our Panometer
pita chips+ project
50. wrapping at http://panometer.
presents-
51. salmon- 51. walking a pet+
org/instruments/lexicocalorimeter. See https://github.com/andyreagan/lexicocalorimeter-appendix
52. goat cheese+ for source code.
52. hunting+
53. yogurt- 53. elliptical+
54. cheddar cheese+ 54. raking+
55. celery- 55. walking leisurely-
56. popcorn+ 56. showering-
forward, we believe it is clear that the Lexicocalorime-
57. fortune cookie+ 57. ultimate frisbee+ We have drawn on
instruments/lexicocalorimeter.
58. turkey- 58. fly fishing+
ter we have designed and implemented is already of
59. peaches- Twitters Gardenhose API which has59.been provided to
bass fishing+
60. lobster- 60. snowmobiling+
some potency and may be improved substantively in the the Computational
61. king crab- Story Lab by Twitter.
61. doing yoga+
62. pastry+ 62. skateboarding+
future. 63. tuna- 63. rowing+
64. potato chips+ 64. packing+
65. asparagus- 65. mini golfing+
66. collards- 66. golfing+
67. pasta- A. Calorie estimates for phrases
67. doing situps+
68. hard candy+ 68. walking briskly+
IV. METHODS AND MATERIALS 69. scallops- 69. kayaking+
70. popeyes chicken+ 70. line dancing+
71. avocado- We used the USDA National Nutrient Database [38] to 71. using stair master+
72. carrot- 72. playing games-
In order to attempt to estimate the caloric content 73. applesauce- approximate the caloric content of foods, and the Com- 73. doing yardwork+
74. pear- 74. running stairs+
of text-extracted phrases [37] relating to food (caloric 75. mayonnaise+ pendium of Physical Activities from Arizona State Uni-
75. doing my makeup+
76. oatmeal- 76. jet skiing+
input) and physical activity (caloric output), we needed
77. kale-
versity and the National Cancer Institute [39] to approx-
77. walking quickly+
comprehensive lists of foods and physical activities and 78. imate
candy bar+ average Metabolic Equivalent 78. ofplaying
Tasks
frisbee+ (METs)
79. ribs- 79. crocheting-
their respective caloric content and expenditure informa- 80. for
mac andphysical
cheese- activities, which 80.we converted to calories
bowling+
81. watermelon- 81. attending a family reunion-
tion. Here, we explain in detail how we constructed these expended per hour of activity [39]. Because the foods
phrase lists and assigned calories to each phrase. listed in the USDA National Nutrient Database are not
In dataset S1 (https://dx.doi.org/10.6084/m9. described in a way that people talk about food, we creat-
figshare.4530965.v1), we provide message IDs for all ed a list of food phrases used on Twitter by starting with
tweets that are part of our study, and we make both this a kernel of basic food terms from the USDAs MyPlate
dataset and other material and visualizations available at websites food group pages [40]. If the food phrase was
the papers Online Appendices (http://compstorylab. not specific, such as cereal, we chose the most popular
org/share/papers/alajajian2015a/, and as part of version of that food in the United States via an informal
our Panometer project at http://panometer.org/ Google search at the time of the study (in this instance,
15
1
Cheerios). If a brand name food was not in the USDA 2) for the text T . We then apply the model of context
National Nutrient Database, we chose the closest match developed in [41] under the parameterization q = 1, so
we could find. (Please note that this means that data in that a given phrase s is a member of `(s) contexts Cs (e.g.,
appendix may be inaccurate when searching brand name the phrase s = (N ew, Y ork, City) is a member of three
items.) contexts, labeled Cs = {(, Y ork, City), (N ew, , City),
This approach yielded examples of foods in the food and (N ew, Y ork, )}). Then for C Cs , we consider the
groups of fruits, vegetables, grains, proteins, dairy, oils, context-local likelihood probabilities:
solid fats, and empty calories (e.g., junk food), and
built up a list of nearly 1400 food phrases used on Twit- f (s)
P (s | C) = P , (7)
ter. For the main results we present in this study, we did f (t)
tC
not include drinks or soups (liquids) in our list. We found
there is very little change in our findings when liquids are and prescribe to s the likelihood-minimizing context
included, as we discuss below, and we have omitted them
at present both for simplicity and because we were not Cs = argmin(P (s | C)), (8)
satisfied with a straightforward way of balancing liquid CCs
and solid nutrition estimates. Note that we have includ-
ed ice creams, oils, and some other items that may act which chooses the context-pattern that is most prevalent
as liquids, and these could be separated out for future in T . The objective function for this instantiation of
versions of our instrument. serial partitioning is then defined as
For physical activity, we used the physical activities
L(s) = P (s | Cs ), (9)
listed in the Compendium to build up a list of nearly
14,000 physical activity phrases used on Twitter. The and referred to as the local likelihood of a phrase s.
order of magnitude of difference between the length of the
two lists exists because of the difference in the number of
terms that went into creating each list and the rates at Algorithm 1 Serial text partitioning of a (left-to-right)
which people tweet about foods vs. physical activities. directional clause, given an objective function L : S
R0 (whose maximization is desired, in this case) that
is zero on the empty phrase (), and a clause t =
(t1 , , t`(t) ), consisting of `(t) words. Note that for
B. Phrase extraction
any a, b S, a_ b = (a1 , , a`(a) , b1 , b`(b) ) denotes
the concatenation of phrases, and that for convenience, a
A major obstacle to the development of the food and single sequence element, ai , may be treated as sequence
physical activity lists is the determination of those phras- of one term, (ai ).
es used by individuals that most accurately represent a 1: procedure SerialTextPartitioning(t)
food or physical activity. Various methods exist which 2: P () . init. the partition.
may help one ascertain information about the frequency 3: s () . init. the phrase.
of usage of higher-order lexical units [37]. However, we 4: for i (1, , `(t)) do
require one that not only determines reasonable estimates 5: if L(s_ ti ) > L(s) then
of frequency of usage, but further, does so with nuance 6: s s_ ti
regarding context. For example, one should not count the 7: else
phrase apple as having occurred if it appeared within a 8: P P _s
larger phrase that was recognized as meaningful, such as 9: s ti
youre the apple of my eye. To accomplish these goals, 10: return P
we define a low-assumption text segmentation algorithm,
which we refer to as serial partitioning. We manually applied the following criteria for con-
Serial text partitioning is a greedy algorithm (see structing both food and exercise phrase lists. For a
Alg. 1) for finding distinct, coherent subsequences (phras- phrase to be included, it had to be a phrase that used the
es) within a sequence (clause). It relies on the direction- food or physical activity word(s) in a way that pertained
ality of a sequence, and so is particularly adept for pro- to eating or physical activity; we excluded phrases that
cessing text into multi-word expressions for many modern were part of hashtags, Twitter user names, song lyrics,
languages. The algorithm relies on an objective function, or names of organizations or businesses, and phrases that
which we will generally refer to as L. At a high level, appeared four or fewer times were not included. Mis-
the algorithm seeks to find find the largest subsequences spellings and alternate spellings were included if we hap-
possible, following a chain of optimizing, growing subse- pened upon them (for example, mash potatoes instead
quences. of mashed potatoes), but we did not go out of our way
In the context of this article, we define L relative to a to search for them. We queried questionable phrases to
text T as follows, providing pseudocode below. First, let be sure that the majority of their uses were referring to
f : S R0 be the random partition frequency function the item of interest. Because we were building up from
[37] under the pure random partition probability (q = a small list, some specific versions of foods were included
16
while more general forms were not. For example, because lenge; there are certainly other methods that we expect
we built phrases up from strawberry, strawberry jam to yield similar results.
was included while we did not conduct a larger search for Finally, we lemmatized the food phrases by their code
jam. In another example, in building phrases up from in the USDA National Nutrient Database. If there were
bacon, bacon wrapped dates turned up so we includ- food phrases that were more general in each set of phrases
ed those dates but did not conduct a larger search for all that held the same code, we used the more general phrase
possible dates. (Note: We removed the physical activi- as the lemma.
ties category sexual activity from the study because the We lemmatized the activity phrases by their METs and
task of determining meaning and context was too diffi- activity category. Activity categories were largely the
cult.) same as listed in the Compendium with slight changes
We searched for phrases containing the physical activ- due to items in Compendium being listed in a Miscella-
ities in multiple tenses in order to capture as much infor- neous category, etc. This yielded instances of physical
mation as possible. For example, for the activity type activity phrases that were in the same activity catego-
shoveling snow, we searched for the forms of shovel, shov- ry but were very different with the same METs being
eling, and shoveled. Tweets were initially converted to all included in the same lemma. From this level of lemmati-
lowercase text, so we were assured that we were not miss- zation, we then used our best judgement to break these
ing data due to capitalization. To match each food phrase lemmas down further until proper phrases were included
with its closest caloric data, we found the most closely in each lemma.
corresponding food from the USDA National Nutrient
Database, counting all vegetables and fruits in their raw
form unless the phrase indicated otherwise. Similarly, we
entered meats as roasted or cooked with dry heat, not ACKNOWLEDGMENTS
fried, unless the phrase indicated otherwise or there was
no homemade option. We used the nutrition content of We thank Slack.com and the Vermont Advanced Com-
homemade versions of foods (for example, baked goods) puting Core for greatly facilitating our work. PSD was
rather than store-bought foods unless the phrase indicat- supported by NSF CAREER Grant No. 0846668, and
ed otherwise. Our approach, while systematic, was not CMD and PSD were supported by NSF BIGDATA Grant
exhaustive, nor is it the only way of taking on this chal- No. 1447634.
[1] Health-related quality of life: Well-being concepts, [14] D. Lazer, R. Kennedy, G. King, and A. Vespignani, Sci-
(2013), health-related quality of life: Well-being ence Magazine 343, 1203 (2014).
concepts. http://www.cdc.gov/hrqol/wellbeing.htm; [15] L. Mitchell, M. R. Frank, P. S. Dodds, and C. M. Dan-
Accessed March 29, 2014. forth, PLoS ONE 8, e64417 (2013).
[2] P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A. [16] P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J.
Bliss, and C. M. Danforth, PLoS ONE 6, e26752 (2011), Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M.
draft version available at http://arxiv.org/abs/1101. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T.
5120v4. Accessed November 15, 2014. McMahon, B. F. Tivnan, and C. M. Danforth, Proc.
[3] E. Diener, M. Diener, and C. Diener, Journal of Person- Natl. Acad. Sci. 112, 2389 (2015).
ality and Social Psychology 69, 851 (1995). [17] R. Chunara, L. Bouton, J. W. Ayers, and J. S. Brown-
[4] State of the States. http://www.gallup.com/poll/ stein, PLoS ONE 8, e61373 (1995).
125066/State-States.aspx; Accessed March 29, 2014. [18] R. West, R. W. White, and E. Horvitz, in Proceedings
[5] Stimmungsgasometer. http://xn--fhlometer-q9a.de/; of the 22nd international conference on World Wide Web
Accessed March 29, 2014. (ACM, 2013) pp. 13991410.
[6] J. Siebens, Extended measures of well-being: Living [19] J. C. Eichstaedt, H. A. Schwartz, M. L. Kern, G. Park,
conditions in the United States: 2011, (2013), accessed D. R. Labarthe, R. M. Merchant, S. Jha, M. Agrawal,
on March 15, 2014. L. A. Dziurzynski, M. Sap, C. Weeg, E. E. Larson, L. H.
[7] M. Duggan and J. Brenner, The demographics of social Ungar, and M. E. P. Seligman, Psychological Science
media users2012, (2013), accessed on March 15, 2014. (2015).
[8] A. Signorini, A. M. Segre, and P. M. Polgreen, PLoS [20] A. Culotta, in Proceedings of the 32Nd Annual ACM
ONE 6, e19467 (2011). Conference on Human Factors in Computing Systemes,
[9] V. M. Prieto, S. Matos, M. Alvarez, F. Cacheda, and CHI 14 (ACM, New York, NY, USA, 2014) pp. 1335
J. L. Oliveira, PLoS ONE 9, e86191 (2014). 1344.
[10] C. Chew and G. Eysenbach, PLoS ONE 5, e14118 (2010). [21] S. Abbar, Y. Mejova, and I. Weber, You tweet what
[11] M. J. Paul and M. Dredze, ICWSM 20, 265 (2011). you eat: Studying food consumption through Twitter,
[12] D. J. Watts, R. Muhamad, D. Medina, and P. S. Dodds, (2015).
Proc. Natl. Acad. Sci. 102, 11157 (2005). [22] S. C. Walpole, D. Prieto-Merino, P. Edwards, J. Cleland,
[13] Google Flu Trends, https://www.google.org/ G. Stevens, and I. Roberts, BMC Public Health 12, 439
flutrends/; accessed March 1, 2015. (2012).
17
[23] Hedonometer 2.0: Measuring happiness and using (2015).

word shifts; Computational Story Lab blog; Octo-
ber 6, 2014; http://compstorylab.org/2014/10/06/
hedonometer-2-0-measuring-happiness-and-using-word-shifts/;
Accessed on March 1, 2015.
[24] Americas Health Rankings reportState Health Statis-
tics; http://AmericasHealthRankings.org, Accessed
March 15, 2014.
[25] Centers for Disease Control and Prevention; http://
www.cdc.gov, Accessed March 15, 2014.
[26] CNBC overall rankings 2012; http://www.cnbc.com/id/
100016697, Accessed March 15, 2014.
[27] State Health FactsThe Henry J. Kaiser Family Foun-
dation; http://kff.org/statedata, Accessed March 15,
2014.
[28] State indicator report on fruits and vegetables. National
Center for Chronic Disease Prevention and Health
Promotion, Division of Nutrition, Physical Activity,
and Obesity. Centers for Disease Control and Preven-
tion, US Department of Health and Human Services,
2013; http://www.cdc.gov/nutrition/downloads/
State-Indicator-Report-Fruits-Vegetables-2013.
pdf, Accessed March 15, 2014.
[29] Americas Brain Health Index; http://www.
beautiful-minds.com/AmericasBrainHealthIndex,
Accessed March 15, 2014.
[30] US Census American FactFinder; http://factfinder2.
census.gov/faces/nav/jsf/pages/index.xhtml,
[31] P. J. Rentfrow, S. D. Gosling, M. Jokela, D. Stillwell,
M. Kosinski, and J. Potter, Journal of Personality and
Social Psychology 105(6), 996 (2013).
[32] Strolling of the Heiders Locavore Index; http:
//www.strollingoftheheifers.com/locavoreindex/,
[33] Freedom in the 50 states, Mercatus Center, George
Mason University; http://freedominthe50states.
org/, Accessed March 15, 2014.
[34] Y. Benjamini and Y. Hochberg, Journal of the Royal Sta-
tistical Society. Series B (Methodological) 57, 289 (1995).
[35] M. T. French, I. Popovici, and J. C. Maclean, Am J
Health Promot 24, 2 (2009).
[36] B. J. Pesta, S. Bertsch, M. A. McDaniel, C. B. Mahoney,
and P. J. Poznanski, Intelligence 40, 107 (2012).
[37] J. R. Williams, P. R. Lessard, S. Desu, E. M. Clark,
J. P. Bagrow, C. M. Danforth, and P. S. Dodds, Nature
Scientific Reports 5, 12209 (2015), available online at
http://arxiv.org/abs/1406.5181.
[38] U.S. Department of Agriculture, Agricultural Research
Service, USDA National Nutrient Database for Standard
Reference, release 25; 2013; http://www.ars.usda.gov/
ba/bhnrc/ndl; Accessed March 15, 2014.
[39] B. E. Ainsworth, W. L. Haskell, S. D. Herrmann,
N. Meckes, D. R. Bassett Jr., C. Tudor-Locke, J. L.
Greer, J. Vezina, M. C. Whitt-Glover, and A. S.
Leon, The Compendium of Physical Activities Track-
ing Guide. Healthy Lifestyles Research Center, College
of Nursing & Health Innovation, Arizona State Universi-
ty, (2013).
[40] USDA MyPlate food groups; http://www.
choosemyplate.gov/food-groups/; Accessed May
15, 2015.
[41] J. R. Williams, E. M. Clark, J. P. Bagrow, C. M. Dan-
forth, and P. S. Dodds, Physical Review E 92, 042808
S1
0.80
0.80
CO ^ : 0.48 CO

WY
p WY

pval: 0.00051
VT
m: 0.007 VT
0.75
0.75

ME

UT
ME
UT

MN OR

MN

OR
NH

NH
0.70
0.70
MT
MT

Crat
Crat
NYWA
NY
WA

WI AZ CA WIAZ CA

DC

NV
FL
ID NE

FL ID NE
DC NV

MA
IA MA NM IA
RI NM

MO RI

MO
IL
VA OK
KS ND

SD

IL KS
VA OKND

SD
0.65
0.65
CT
TNPA
CT
PA IN
TN

NJ IN
NJ

MD SCNC TX

MD SC

NCTX

GA MI OH
GA MI OH

KY KY

DE DE AR WV
^ : 0.93

ARAL
WV

AL

p
pval: 0
0.60
LA
0.60

LA

MS
MS
m: 0.0042
255 260 265 270 275 280 285 160 170 180 190 200 210
Cin Cout
FIG. S1. Plots for the contiguous US showing the relationships Crat versus Cin (left), and Crat versus Cout (right). With its
larger range, caloric output Cout is more tightly coupled with the ratio Crat .
S2

OH, 5 OH, 32 Ohio, 41
TX, 13 TX, 35 Texas, 36
ID, 30 ID, 19 Idaho, 18
FIG. S2. Histograms as per Fig. 5 with states sorted by food rank. The bar colors correspond those used in for the choropleth
maps in Figs. 1, 2, and 3.
S3

ID, 30 ID, 19 Idaho, 18
OH, 5 OH, 32 Ohio, 41
TX, 13 TX, 35 Texas, 36
FIG. S3. Histograms as per Fig. 5 with states sorted by activity rank. The bar colors correspond those used in for the
choropleth maps in Figs. 1, 2, and 3.
S4
Why Colorado consumes

Fourlessviews
caloriesofonfood
average: Whyshifts
phrase Colorado consumes
for less calories on average:
Colorado
Average US calories = 267.25 Average US calories = 267.25
Colorado calories = 256.58 (Rank 2 out of 49) Colorado calories = 256.58 (Rank 2 out of 49)
A. High calorie foods mentioned more: B. Low calorie foods mentioned less:
+ + + +
- - - -

3. bacon+ 6. chicken-
7. olive oil+ 9. shrimp-
23. almonds+ 13. crab-
24. pistachios+ 15. ice cream-
34. girl scout cookie+ 18. pineapple-
47. candy bar+ 28. mango-
52. hard candy+ 31. catfish-
70. walnuts+ 32. corn-
71. onion rings+ 33. oranges-
Food rank
Food rank
72. coffee cake+ 37. applesauce-
77. cheeseburger+ 40. broccoli-
91. parmesan cheese+ 44. oatmeal-
94. falafel+ 45. banana pudding-
97. italian sausage+ 46. mac and cheese-
98. popeyes chicken+ 48. strawberries-
100. oatmeal raisin cookie+ 50. sweet potato-
103. glazed donut+ 51. chicken salad-
105. cookie dough+ 57. grits-
110. beef jerky+ 58. collards-
115. banana chips+ 60. beef-
119. cream cheese+ 65. macaroni-
131. rice cakes+ 82. raspberry-
132. peanut brittle+ 83. salmon-
Why Colorado consumes less calories
-5 0 135. kentucky
5 friedaverage:
on chicken+ Why Colorado consumes less calories
-5 0 84. clam-
5 on average:
Per food
Average US calories = 267.25 phrase caloric
136. shift
kettle corn+ Per food
Average US calories = 267.25 phrase93.caloric shift
chicken breast-
141. white cheddar popcorn+ 96. lobster-
Colorado calories = 256.58 (Rank 2 out of 49) Colorado calories = 256.58 (Rank 2 out of 49)
C. High calorie foods mentioned less: D. Low calorie foods mentioned more:
+ + + +
- - - -

2. chocolate candy+ 1. noodles-

4. cake+ 8. pasta-
5. cookies+ 10. apples-
22. donuts+ 11. cucumber-
25. cheese+ 12. egg-
27. butter+ 14. tomato-
29. cake with frosting+ 16. peaches-
35. peanut butter+ 17. turkey-
38. mayonnaise+ 19. onion-
Food rank
Food rank
39. popcorn+ 20. cabbage-

53. crackers+ 21. pear-
61. potato chips+ 26. grapes-
63. pecans+ 30. asparagus-
64. coconut oil+ 36. carrot-
68. corn chips+ 41. greek yogurt-
69. chocolate cake+ 42. green pepper-
87. bacon fat+ 43. spinach-
88. cashews+ 49. frozen yogurt-
89. cheese puffs+ 54. brussels sprouts-
90. apple jacks+ 55. celery-
92. sunflower seeds+ 56. spaghetti-
116. peanuts+ 59. kale-
120. pita chips+ 62. flounder-
139. cheese stick+
-5 0 5 66. eggnog-
-5 0 5
145. crunchyPer food
peanut phrase
butter+ caloric shift Per foodbeans-
67. green phrase caloric shift
149. fried chicken+ 73. mussels-
FIG. S4. Food phrase shifts for Colorado, broken down into the four ways phrases may contribute to a shift. See Fig. 6A for
the combined shift. See Subsec. Phrase Shifts in Sec. Analysis and Results for an explanation of phrase shifts.
S5
Why Mississippi consumes more calories

Four views on average:
of food phrase Why Mississippi
shifts consumes more calories on average:
for Mississippi
Average US calories = 267.25 Average US calories = 267.25
Mississippi calories = 271.37 (Rank 34 out of 49) Mississippi calories = 271.37 (Rank 34 out of 49)
A. High calorie foods mentioned more: B. Low calorie foods mentioned less:
+ + + +
- - - -

1. cake+ 5. pasta-
2. cookies+ 6. banana-
11. sausage+ 12. crab-
21. mayonnaise+ 16. apples-
27. chocolate candy+ 18. mango-
36. pecan pie+ 20. onion-
46. peanuts+ 24. turkey-
47. apple jacks+ 26. broccoli-
55. cake with frosting+ 30. spinach-
Food rank
Food rank
56. crackers+ 32. cucumber-
61. mixed nuts+ 33. carrot-
71. cheese puffs+ 37. lobster-
85. cheese stick+ 38. tomato-
104. pecans+ 41. corn-
106. corn flakes+ 42. eggnog-
114. chicken tenders+ 44. frozen yogurt-
120. cheese grits+ 45. avocado-
123. turkey bacon+ 48. brussels sprouts-
127. fried chicken+ 49. blueberry-
128. butter+ 51. oranges-
130. popeyes chicken+ 53. raspberry-
134. cheddar cheese+ 58. celery-
135. little debbie cakes+ 59. tofu-
Why Mississippi consumes -1 141. corn
0 more 1 chips+ on average: Why Mississippi consumes
calories -1 60. salmon-
0 more calories
1 on average:
Per food phrase145.
caloric
sausageshift
biscuit+ Per food phrase63.
caloric shift
oatmeal-
150. onion-flavored potat+ 65. beet-
Mississippi calories = 271.37 (Rank 34 out of 49) Mississippi calories = 271.37 (Rank 34 out of 49)
C. High calorie foods mentioned less: D. Low calorie foods mentioned more:
+ + + +
- - - -

13. olive oil+ 3. shrimp-

15. bacon+ 4. pineapple-
28. donuts+ 7. catfish-
34. girl scout cookie+ 8. mashed potatoes-
43. cookie dough+ 9. grits-
52. pastry+ 10. chicken-
54. popcorn+ 14. peaches-
62. candy bar+ 17. cabbage-
72. hard candy+ 19. sweet potato-
Food rank
Food rank
81. peanut butter+ 22. banana pudding-

84. onion rings+ 23. pear-
88. pistachios+ 25. ice cream-
92. cheesecake+ 29. king crab-
95. cream cheese+ 31. spaghetti-
96. breadsticks+ 35. green beans-
97. potato chips+ 39. pork chop-
100. cheese+ 40. okra-
103. sugar cookie+ 50. green tomatoes-
105. shortcake+ 57. snapper-
107. bacon fat+ 64. fried rice-
117. goat cheese+ 67. chicken salad-
122. cheeseburger+ 69. tuna-
125. pumpkin seeds+ 70. strawberries-
131. pretzels+
-1 0 1 73. candied-1yams- 0 1
133. Per
almondfood phrase
butter+ caloric shift Per76.food phrase
collards- caloric shift
136. blue cheese+ 82. creamed corn-
FIG. S5. Food phrase shifts for Mississippi, broken down into the four ways phrases may contribute to a shift. See Fig. 6B
for the combined shift. See Subsec. Phrase Shifts in Sec. Analysis and Results for an explanation of phrase shifts.
S6
Why Colorado expends

Fourmore calories
views on average:phrase
of activity Why Colorado expends
shifts for more calories on average:
Colorado
Average US caloric expenditure = 176.60 Average US caloric expenditure = 176.60
Colorado caloric expenditure = 203.48 (Rank 2 out of 49) Colorado caloric expenditure = 203.48 (Rank 2 out of 49)
A. High calorie activities mentioned more: B. Low calorie activities mentioned less:
+ + + +
- - - -

1. running+ 6. eating-
2. skiing+ 8. laying down-
3. hiking+ 11. watching tv/movies-
4. snowboarding+ 12. talking on phone-
5. biking+ 16. playing video games-
7. mountain biking+ 18. showering-
9. white water rafting+ 22. getting my hair done-
Activity rank
Activity rank
10. rock climbing+ 30. getting my nails done-
13. sledding+ 34. attending church-
14. ice skating+ 40. boating-
17. walking+ 50. typing-
19. jazzercise+ 68. watching tv or movie-
20. using treadmill+ 77. wrapping presents-
21. scuba diving+ 88. washing dishes-
23. bowling+ 106. walking leisurely-
24. mountain climbing+ 133. getting my hair and-
27. golfing+ 140. taking medicine-
29. doing yoga+ 145. sitting and listening-
31. swimming+ 147. sitting on a toilet-
32. doing the cooking da+ 151. watching child-
35. backpacking+ 153. brushing my teeth-
37. rafting+ 160. parasailing-
38. line dancing+ 162. bird watching-
Why Colorado expends 045. calories
-5 more ultimate
5 frisbee+
on average: -5 0164. 5ironing-
Per activity phrase caloric expenditure
47. jogging+ shift WhyPer Colorado
activity expends
phrase more
caloric calories on average:
expenditure
168. playing games- shift
48. boxing+ 186. pumping gas-
Colorado caloric expenditure = 203.48 (Rank 2 out of 49)
C. High calorie activities mentioned less: D. Lowcaloric
Colorado calorie activities
expenditure mentioned
= 203.48 (Rank 2 out of 49)more:
+ + + +
- - - -

25. cooking+ 15. reading-
28. playing basketball+ 26. writing-
33. pole dancing+ 36. sitting-
39. cleaning+ 46. knitting-
41. playing football+ 89. online shopping-
42. playing dodgeball+ 109. crocheting-
43. jumping jacks+ 111. snuggling or petting-
Activity rank
44. doing pushups+

Activity rank
114. arts and crafts-

51. doing my hair+ 125. standing-
52. walking quickly+ 138. meditating-
53. hunting+ 152. attending a family r-
57. mowing grass+ 155. finger painting-
61. aerobics+ 166. drawing-
65. kayaking+ 183. watching sports in p-
67. table dancing+ 194. shaving-
69. picking fruit+
73. cleaning vehicles+
79. drag racing+
83. walking briskly+
87. getting dressed+
95. praise dancing+
96. running uphill+
97. square dancing+
98. canoeing+
-5 0 5 -5 0 5
Per activity phrase caloric
100. paddleboarding+ expenditure shift Per activity phrase caloric expenditure shift
101. doing bikram yoga+
FIG. S6. Activity phrase shifts for Colorado, broken down into the four ways phrases may contribute to a shift. See Fig. 6C
S7
Why Mississippi expends fewer calories

Four views on average:
of activity Why Mississippi
phrase shifts forexpends fewer calories on average:
Mississippi
A. High calorie activities mentioned more: B. Low caloric
Mississippi calorie activities
expenditure = 161.26mentioned less:
(Rank 49 out of 49)
+ + + +
- - - -

4. cooking+ 5. watching tv/movies-
19. playing football+ 22. getting my nails done-
23. cleaning+ 26. reading-
31. weight lifting+ 27. writing-
35. fishing+ 33. boating-
42. doing pushups+ 38. standing-
43. hunting+ 41. typing-
Activity rank
49. doing situps+
Activity rank
53. playing video games-
60. doing the cooking da+ 59. wrapping presents-
67. mopping+ 65. knitting-
75. playing dodgeball+ 69. meditating-
77. doing my hair+ 71. watching child-
78. table dancing+ 106. walking leisurely-
79. moving furniture+ 107. shaving-
84. cleaning yard+ 125. snuggling or petting-
89. power lifting+ 129. crocheting-
92. doing the safety dance+ 130. drawing-
93. jumping jacks+ 140. playing guitar-
94. body building+ 153. parasailing-
97. getting dressed+ 157. bird watching-
105. doing the chicken da+ 165. arts and crafts-
109. playing disc golf+ 171. sitting on a toilet-
112. bass fishing+ 200. watching sports in p-
Why Mississippi expends 0116.
2 4 doing
-4-2fewer laundry+
calories on average:
Per activity phrase caloric expenditure
124. square dancing+ shift
Why Mississippi expends fewer
-4-2 0 2 4 calories on average:
Average US caloric expenditure = 176.60 Per activity phrase caloric expenditure shift
Average US caloric expenditure = 176.60
128. praise dancing+
C. High calorie activities mentioned less: D. Lowcaloric
Mississippi calorie activities
expenditure = 161.26mentioned more:
(Rank 49 out of 49)
+ + + +
- - - -

1. running+ 3. eating-
2. dancing+ 6. laying down-
7. walking+ 13. attending church-
8. biking+ 14. talking on phone-
9. ice skating+ 15. sitting-
10. using treadmill+ 16. getting my hair done-
11. swimming+ 30. showering-
Activity rank
12. hiking+
Activity rank
47. watching tv or movie-

17. boxing+ 56. washing dishes-
18. bowling+ 74. playing video games-
20. golfing+ 99. playing games-
21. sledding+ 110. sitting and listening-
24. skiing+ 137. ironing-
25. snowboarding+ 176. attending a family r-
28. mountain biking+
29. jogging+
32. shopping+
34. mowing grass+
36. rock climbing+
37. roller skating+
39. doing yoga+
40. shoveling+
44. playing basketball+
45. pole dancing+
-4-2 0 2 4 -4-2 0 2 4
Per activity
46. ultimatephrase caloric
frisbee+ expenditure shift Per activity phrase caloric expenditure shift
48. walking a pet+
FIG. S7. Activity phrase shifts for Mississippi, broken down into the four ways phrases may contribute to a shift. See Fig. 6D
S8
s for s for s for

Crat Cin Cout
9. % adult obesity [25] -0.69 3.10 1007 0.52 4.11 1004 -0.59 3.56 1005
16. Smoking rate [27] -0.59 3.81 1005 0.47 1.60 1003 -0.48 1.24 1003
22. US Census Gini index score [30] (lower is better) -0.44 3.60 1003 0.11 5.12 1001 -0.5 6.22 1004
27. Farmers markets per 100,000 in pop. [28] 0.33 2.96 1002 -0.01 9.59 1001 0.42 5.41 1003
37. Agreeableness Big Five personality trait [31] 0 9.95 1001 0.24 1.26 1001 0.08 6.41 1001
TABLE S1. Identical to Tab. I but with liquids included. Spearman correlation coefficients, s , and Benjamini-Hochberg
q-values for caloric input Cin , caloric output Cout , and caloric ratio Crat = Cout /Cin and demographic data related to food and
physical activity, Big Five personality traits [31], health and well-being rankings by state, and socioeconomic status, correlated,
ordered from strongest to weakest Spearman correlations with caloric ratio.
S9
s for s for s for

Cdiff Cin Cout
9. % adult obesity [25] -0.72 3.70 1008 0.53 2.26 1004 -0.59 2.94 1005
16. Smoking rate [27] -0.6 2.14 1005 0.51 5.19 1004 -0.48 1.08 1003
23. US Census Gini index score [30] (lower is better) -0.42 4.99 1003 -0.03 8.45 1001 -0.5 5.55 1004
28. Farmers markets per 100,000 in pop. [28] 0.33 2.82 1002 0.06 7.17 1001 0.42 5.05 1003
37. Agreeableness Big Five personality trait [31] -0.01 9.42 1001 0.22 1.50 1001 0.08 6.47 1001
TABLE S2. Identical to Tab. I but using a caloric difference rather than caloric ratio. Spearman correlation coefficients, s ,
and Benjamini-Hochberg q-values for caloric input Cin , caloric output Cout , and caloric difference Cdiff () = Cout + (1 )Cin
and demographic data related to food and physical activity, Big Five personality traits [31], health and well-being rankings by
state, and socioeconomic status, correlated, ordered from strongest to weakest Spearman correlations with caloric ratio. We
chose so that the average of Cout matched the average of Cin .
S10
s for s for s for

Cdiff Cin Cout
9. % adult obesity [25] -0.69 3.40 1007 0.52 4.11 1004 -0.59 3.56 1005
16. Smoking rate [27] -0.59 3.77 1005 0.47 1.60 1003 -0.48 1.24 1003
22. US Census Gini index score [30] (lower is better) -0.44 3.41 1003 0.11 5.12 1001 -0.5 6.22 1004
27. Farmers markets per 100,000 in pop. [28] 0.33 2.88 1002 -0.01 9.59 1001 0.42 5.41 1003
37. Agreeableness Big Five personality trait [31] 0 9.85 1001 0.24 1.26 1001 0.08 6.41 1001
TABLE S3. Identical to Tab. I but including liquids and using a caloric difference rather than caloric ratio. Spearman
correlation coefficients, s , and Benjamini-Hochberg q-values for caloric input Cin , caloric output Cout , and caloric difference
Cdiff () = Cout + (1 )Cin and demographic data related to food and physical activity, Big Five personality traits [31],
health and well-being rankings by state, and socioeconomic status, correlated, ordered from strongest to weakest Spearman
correlations with caloric ratio. We chose so that the average of Cout matched the average of Cin .

Alajajian 2015 A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alajajian 2015 A

Uploaded by

Copyright:

Available Formats

The Lexicocalorimeter:

I. INTRODUCTION the Gallup Well-Being Index, which is based on factors

Deviation from national foodcaloric avg.

Deviation from national activitycaloric avg.

wat tv watching tv or movie

ovie ov watchin movie

In Fig. 1, we show two choropleth maps of our overall green.

Deviation from national foodcaloric avg.

Deviation from national activitycaloric avg.

states, while Michigan also appears to have a low value

Deviation from national avgs. caloric Ratio

food) and watching tv or movie (low calorie activity)

(ice cream) and Wyoming (cookies) are exceptions,

In Fig. 2, we present the same choropleth maps

all bars are relative to the overall average of the specific

m: 1.64 histograms re-sorted respectively by Cin and Cout .

As was indicated by our inspection the choropleth

MT en by Cout than Cin due to the formers larger dynamic

At the trailing end, we see by contrast that low activity

ranks are coupled with high ranks for caloric intake.

SC MI WV A few of the more anomalous states are both evident

Cout (ranks of 1 and 7) that arrange to give it a ranking

FIG. 4. Plots for the contiguous US showing the lack

Deviations from national averages

15 Food 17 15 Activity 30 0.12 Ratio 0.13

(6) Colorado ranks 2/49 for caloric output Cout

High Blood Pressure

0.6 0.7 0.6 0.7 0.6 0.7

s for s for s for

24. mashed potatoes-

34. mushrooms- 34. paddleboarding+

35. spaghetti squash- 35. jazzercise+

36. green pepper- 36. mowing grass+

37. tortilla- 37. attending church-

38. baked potato- 38. playing video or computer games-

[23] Hedonometer 2.0: Measuring happiness and using (2015).

Deviations from national averages

15 Food 17 15 Activity 30 0.12 Ratio 0.13

Deviations from national averages

15 Food 17 15 Activity 30 0.12 Ratio 0.13

Why Colorado consumes

2. chocolate candy+ 1. noodles-

39. popcorn+ 20. cabbage-

Why Mississippi consumes more calories

13. olive oil+ 3. shrimp-

81. peanut butter+ 22. banana pudding-

Why Colorado expends

44. doing pushups+

114. arts and crafts-

Why Mississippi expends fewer calories

49. doing situps+

47. watching tv or movie-

s for s for s for

s for s for s for

s for s for s for

You might also like