Professional Documents
Culture Documents
Jonathan Chang and Itamar Rosenn and Lars Backstrom and Cameron Marlow
Facebook
1601 S California Ave
Palo Alto, California 94304
{jonchang, itamar, lars, cameron}@facebook.com
KL divergence
η. We set the α proportional to the ethnic breakdown of the
population at large, and η so as to have a weak, symmetric
Dirichlet prior on βrl . This model is depicted graphically in 100%
Asian/Pacific Islander
Figure 1. Black
To determine the values of the hidden variables of the 80%
Hispanic
White
model — the individual ethnicities (zn ), the ethnicity of the
aggregate population (θ), and the ethnic breakdown of first 60%
f l
p(θ, β1:R , z1:N |f1:N , l1:N , α, η, β1:R ). 20%
our data set is small, the amount of information that can be 2% Black
Hispanic
learned about each first name is small; we hypothesize that
0% White
for larger data sets where more can be learned about each
first name, the boost in accuracy of fullname.model over jan−2006 jan−2007 jan−2008 jan−2009
5% 5% 5%
0% 0% 0%
Asian/Pacific−Islander Black Hispanic White Asian/Pacific−Islander Black Hispanic White Asian/Pacific−Islander Black Hispanic White
200%
of each individual in a population allows us to understand
how different ethnicities relate to one another, as described
150% in Equation 1. Here, we analyze two kinds of Facebook rela-
tionships: romantic and friendship. We do so by examining
the number of pairs of people in each relationship, broken
100%
down by ethnicity.
Let R denote the set of pairs engaged in the relationship
∆ 1
50% Asian/Pacific−Islander of interest. Then let Q̂ = |R| (r1 ,r2 )∈R Qr1 ,r2 be a R × R
Black
Hispanic
matrix whose entries denote the estimated proportion of rela-
0% White tionship being of a particular pair of ethnicities. Because the
jan−2006 jan−2007 jan−2008 jan−2009
entries of this matrix will be biased towards highly frequent
ethnicities, it is helpful to divide through by the expected
Figure 6: Relative saturation of ethnicities on Facebook. As the
value of each matrix entry if relationships were formed with-
∆
lines converge towards 100% (center), the makeup of U.S. Facebook out regard to ethnicity, Q∗ = θ ∗ θ ∗ T , θr∗ = |{(r1 , r2 )|r1 =
converges towards that of the addressable Internet population. r}|. The entries of the normalized matrix characterizes the
dependence of the relationship on that ethnicity pair.
The dashed lines show the breakdown of U.S. Internet house- The normalized ethnic breakdown of romantic relation-
holds. (Because White users are a large majority, we have ships is shown in Figure 7(a). Dark tiles indicate ethnic pairs
left them out of this plot.) Although Asian/Pacific-Islanders who relate more frequently than expected by chance, while
are still overrepresented, other minorities have nearly reached lighter tiles indicate ethnic pairs who relate less frequently.
parity with the addressable Internet population. The dark tiles along the diagonal indicate that minority eth-
Another approach to visualizing this data is to look at the nicities are assortative in their relationship preferences. The
relative saturation of each race. This is the fraction of users lighter tiles on the off-diagonal indicate that ethnicities have
on Facebook compared to the fraction we would expect from a dispreference towards romantic relationships with other
the U.S. Internet population at that time. For instance, if ethnicities.
Facebook had 100M users, and Asian Americans made up The normalized ethnic breakdown of friendships using
4.4% of the U.S. Internet population, we would expect to find both normalizations is shown in Figure 7(b). The trends in
4.4M Asian users on Facebook. If instead we observe 5M romantic relationships are echoed in the friendships.
then the relative saturation would be roughly 114%. Figure 7(c) shows, for each pair of ethnicities, the frac-
tion of friendships initiated by a user of the column ethnicity.
Figure 6 shows Facebook saturation by ethnic and racial Darker tiles indicate a prevalence towards initiation. Thus
groups. Since 2005, Asian/Pacific Islanders have been much Asian/Pacific-Islanders are less likely to initiate friendships
more likely to be on Facebook than Whites, which remains with those outside their ethnicity, while Whites are most
true to this day. While Hispanics were once only 40 percent likely to do so. The inclination or disinclination to seek
as likely as Whites to be on the site, this likelihood has been friendships with those outside one’s own ethnicity helps ex-
steadily climbing since early 2007, and is currently at 80 plain the insularity pattern observed in Figure 7(a) and (b).
percent. The graph also shows that Black users are now The previous figures are suggestive that individuals’ eth-
about equally as likely to be on the site as White users. nicity can be predicted through their social ties. We explore
this possibility by estimating each user’s ethnicity using the
48%
49%
White White White
50%
51%
52%
recipient
Black Black Black
Asian/Pacific−Islander Black Hispanic White Asian/Pacific−Islander Black Hispanic White Asian/Pacific−Islander Black Hispanic White
initiator
muslim
10 −0.5 jewish
Asian/Pacific−Islander
Black
10−1 christian
Hispanic
−1.5
10 atheist White
10−2
RMSE
2 4 6 8 10
10−2.5 Value relative to overall mean
NOT targeted
nl_NL Photos shared
zh_HK
Notes created
ru_RU
ja_JP Items liked
ko_KR
Groups created
ar_AR
fa_IR
Events created
vi_VN
it_IT Wall posts made
de_DE
Pokes made
zh_TW
targeted
pt_BR Messages sent
zh_CN
Gifts sent
tr_TR
fr_FR
Comments Made
es_ES
0.8 1.0 1.2 1.4
en_PI Value relative to overall mean
es_LA
Figure 12: Usage characteristics of U.S. Facebook users by ethnicity.
en_GB
Numbers are relative to the total number of U.S. Facebook users
id_ID
with that affiliation.
en_US
5 10
Value relative to overall mean
15 Figure 13 shows the geographical distribution of different
ethnicities. Each point is a spatially-binned collection of
Figure 10: The locale of U.S. Facebook users by ethnicity. Numbers users, sized and colored by the relative proportion of users of
are relative to the total number of U.S. Facebook users with that
that ethnicity. Larger/redder points denote areas where that
locale.
ethnicity is more prevalent than one would expect by chance.
For Blacks, highly prevalent areas are concentrated in the
Asian/Pacific−Islander
South, while for Hispanics, these areas are in the Southwest,
Other
Black California, and Florida. Asian/Pacific-Islanders, on the other
Hispanic
White
hand, are more prevalent in coastal urban centers, especially
Libertarian
the San Francisco Bay Area.
Apathetic
5. Discussion and Related Work
Very Conservative There has been much discussion on the issue of race and class
in the context of the Internet, and social media in particular.
Conservative While most of the early dialog focused on access (e.g., the
“Digital Divide”), researchers have more recently shifted their
Moderate
focus to the differentiation in scope and use of the Internet
for varying purposes (DiMaggio et al. 2004). Recently, some
Liberal
research has suggested that online social network member-
ship is becoming increasingly assortative (Boyd 2009), and
Very liberal
that usage of online social media is becoming differentiated
by socio-economic status (Hargittai and Walejko 2008).
Not specified
We propose an approach to determine the ethnic break-
0.4 0.6 0.8 1.0
Value relative to overall mean
1.2 1.4 down of a population based solely on people’s names and
data provided by the U.S. Census Bureau. We demonstrate
Figure 11: The political affiliation of U.S. Facebook users by eth- that our approach is able to predict the ethnicities of individu-
nicity. Numbers are relative to the total number of U.S. Facebook
users with that affiliation.
als as well as the ethnicity of an entire population better than
natural alternatives. We apply our technique to the population
of U.S. Facebook users and discover that while Facebook has
Figure 12 shows how users of different ethnicities use the
always been diverse, this diversity has increased over time
site. Asian/Pacific-Islanders are the most engaged, with an
leading to a population that today looks very similar to the
unexpectedly high number of wall, video, note, gift, com-
overall U.S. population.
ment, and group sharing actions. Blacks and Hispanics share
less than the site average; photos are a notable exception Caveats The observations made in this paper are based on
for Hispanics, while Blacks tend to update their status more a fairly noisy feature, namely people’s names, and while the
often and send more private messages. results are significant the interpretation should come with a
White Black
45
40
35
30
25
Asian/Pacific−Islander Hispanic
45
40
35
30
25
−120 −110 −100 −90 −80 −70 −120 −110 −100 −90 −80 −70
Figure 13: The geographical distribution of U.S. Facebook users by ethnicity. Each point is a spatially-binned collection of users, sized and
colored by the relative proportion of users of that ethnicity.
few caveats. First, we have empirically evaluated our model On smoothing and inference for topic models. In Uncertainty in
in Section 3., but have not yet theoretically modeled error Artificial Intelligence.
throughout our calculations Second, we have left out a signif- Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet
icant portion of the population through smaller ethnic groups. allocation. Journal of Machine Learning Research.
These groups should be included with appropriate interpreta- Boyd, D. 2009. The Not-So-Hidden Politics of Class Online.
tion for a complete analysis. Finally, and most importantly, Buechley, R. W. 1976. Generally useful ethnic search system,
while ethnicity is an important factor in understanding user GUESS. In Annual Meeting of the American Names Society.
behavior, it is often only a proxy for other variables, such Coldman, A. J.; Braun, T.; and Gallagher, R. P. 1988. The
as socioeconomic status, or education. A complete analy- classification of ethnic status using name information. Journal of
sis should control for all such factors to understand which Epidemiology and Community Health 42(4):390–395.
inferred features have the most significant impact. DiMaggio, P.; Hargittai, E.; Celeste, C.; and Shafer, S. 2004.
From unequal access to differentiated use: A literature review and
Future Work The approach taken in this paper suggests a agenda for research on digital inequality. New York, NY: Russell
framework for understanding user behavior in terms of demo- Sage Foundation. 355–400.
graphic features determined through unsupervised modeling.
Erosheva, E.; Fienberg, S.; and Lafferty, J. 2004. Mixed-
A number of extensions could extend the observations made membership models of scientific publications. Proceedings of
in this paper. First, the findings of assortative mixing are pos- the National Academy of Sciences.
sibly the most important piece of behavioral analysis. While Fiscella, K., and Fremont, A. M. 2006. Use of geocoding and
we have only presented a snapshot in time, this analysis could surname analysis to estimate race and ethnicity. Health Services
very easily be extended over a period of time to understand Research 41(4):1482–1500.
how relationships evolve as Facebook grows and friendships Hargittai, E., and Walejko, G. 2008. The participation divide:
grow over time. Second, relationships can also be an impor- Content creation and sharing in the digital age. Information Com-
tant source of information for modeling ethnicity, and the munication and Society 11(2):239.
results of Figure 8 are suggestive that this information can Kali, J. C.; Bethel, J.; Burke, J.; Morganstein, D.; and Westat, S. H.
improve predictions dramatically. Third, the data provided by Using names to check accuracy of race and gender coding in naep.
the census goes beyond surnames, and includes information ASA Section on Survey Research Methods.
about location, professions and other features disclosed by Lauderdale, D. S., and Kestenbaum, B. 2000. Asian american
social network users. These data could be easily incorporated ethnic identification by surname. Population Research and Policy
to improve the predictive power, as shown in Figure 13. Review 19:283–300.
Steyvers, M., and Griffiths, T. 2007. Probabilistic topic models.
References Handbook of Latent Semantic Analysis.
Ambekar, A.; Ward, C.; Mohammed, J.; Male, S.; and Skiena, S. Tucker, D. K. 2005. The cultural-ethnic language group technique
2009. Name-ethnicity classification from open sources. In KDD. as used in the Dictionary of American Family Names (DAFN).
Onomastica Canadiana 87(2).
Asuncion, A.; Welling, M.; Smyth, P.; and Teh, Y. W. 2009.