ePluribus : Ethnicity on Social Networks
Facebook 1601 S California AvePalo Alto, California 94304
jonchang, itamar, lars, cameron
We propose an approach to determine the ethnic break-
down of a population based solely on people’s names and
data provided by the U.S. Census Bureau. We demon-strate that our approach is able to predict the ethnicitiesof individuals as well as the ethnicity of an entire pop-ulation better than natural alternatives. We apply our
technique to the population of U.S. Facebook users and
uncover the demographic characteristics of ethnicitiesand how they relate. We also discover that while Face-book has always been diverse, diversity has increasedover time leading to a population that today looks verysimilar to the overall U.S. population. We also ﬁnd thatdifferent ethnic groups relate to one another in an as-sortative manner, and that these groups have differentproﬁles across demographics, beliefs, and usage of site
of a user base is an important demographicindicator that can be used for marketing, compliance, and
analytics as well as a scientiﬁc tool for understanding social
behavior and increasing diversity through outreach efforts.Unfortunately, ethnic information is often unavailable for
practical, legal, or political reasons.
In this paper, we propose a technique that combines cen-sus information with user features such as surname to inferthe overall ethnic breakdown of a population as well as theethnicity of individual users. Our technique leverages themachinery of mixture modeling and we demonstrate that it
achieves better predictive accuracy than other natural alterna-
We then apply our technique to Facebook
, a social net-
working site that does not explicitly collect ethnicity. Using
the inferred ethnicity of the U.S. user base, we answer the
How diverse are Facebook users?
How do users of different ethnicities use Facebook?
2010, Association for the Advancement of Artiﬁcial
Intelligence (www.aaai.org). All rights reserved.
Throughout this paper we use the term “ethnicity” to refer to
any racial or ethnic census category.
What are the demographic characteristics of each ethnicity
How do ethnicities interact on Facebook?
We report that Facebook ethnicities are assortative in their
interactions, which is consistent with other studies. We also
estimate that Facebook has always been diverse and that its
diversity has increased signiﬁcantly over the past year, to thepoint where the diversity of Facebook users in the U.S. nearly
mirrors that of the overall population of the country.
In this section, we describe a probabilistic, Bayesian ap-proach to estimating the distribution of ethnicities of a pop-ulation given only their names. Our model takes a list of users’ names and applies both census name statistics and
unsupervised learning techniques to estimate the ethnicities
of members of the population.
We emphasize that our approach is only lightly supervisedusing census name statistics. Because representative groundtruth is impossible to obtain, traditional supervised discrimi-native methods such as logistic regression and support vector
machines cannot be directly applied to this problem.
The U.S. Census Bureau’s Genealogy Project publishes a
data set containing the frequency of popular surnames alongwith a breakdown by race and ethnicity.
These data provide
the rank in the population, the total count of people withthe name, their proportion per 100,000 Americans, and thepercent for various ethnicities: White, Black, Asian/PaciﬁcIslander (API), American-Indian/Alaskan Native (AI/AN),
two or more races and Hispanic.
This data set allows us to predict a person’s ethnicity based
solely on their surname. Table 1 shows the top three sur-
names within the top 1,000 ordered by the percent in a given
group. It shows that some ethnicities have distinctive sur-
names while others do not. For instance, 98.1% of individuals
with the name Yoder are White, while the most predictive
name for American Indian / Alaskan Native individuals only
While there are many preferences for describing people’s eth-nicity, we have chosen to use the terms used in the U.S. Census to
be consistent with our data.