Professional Documents
Culture Documents
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Washington Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend
access to Journal of the Washington Academy of Sciences
Abstract
The availability and usability of massive data sets have added to the increasing
popularity of big data research. However, common mechanisms of big data
collection (e.g., social media, open source platforms, and other online user data)
can be problematic. Sampling issues, especially selection bias, associated with
these data sources can have far reaching implications for analysis and
interpretation. This paper examines the types of sampling issues that arise in big
data projects, how and why biases occur, and their implications. It concludes by
providing strategies for dealing with sampling and selection bias in big data
projects.
Fall 2015
This content downloaded from
196.191.53.146 on Wed, 20 Jul 2022 06:00:08 UTC
All use subject to https://about.jstor.org/terms
30
Fall 2015
This content downloaded from
196.191.53.146 on Wed, 20 Jul 2022 06:00:08 UTC
All use subject to https://about.jstor.org/terms
32
Figure 1. Classic Case of Selection Bias: Chicago Tribune Declares “Dewey Defeats
Truman” (photo credit: Associated Press).
Fall 2015
This content downloaded from
196.191.53.146 on Wed, 20 Jul 2022 06:00:08 UTC
All use subject to https://about.jstor.org/terms
34
the conversation), and are therefore different than those hashtags that were
unsuccessful in generating conversation.
There are several ways researchers may account for selection bias in big
data analysis. Traditionally, selection biases are controlled for through
statistical modeling. Researchers should examine the overall demographics
of their data source and attempt to control for confounding variables.
Certain models and statistical methods are also better equipped to handle
such biases, such as a regression discontinuity design (Taylor 2014),
difference-in-difference models with a matching model as suggested by
Heckman (1979), and a non-linear Tobit model, used by both Heckman
(1979) and Berk (1983).
Additionally not all selection bias is problematic. In some cases, notably
market research, selection bias enables greater targeting of advertising. For
example, retailers use “just-in-time” coupon delivery to target specific
buyers. Entertainment services like Netflix and Amazon use similar
methods to make suggestions to viewers. Likewise, in research projects
where motivated participants are desired, the act of participating in a
hashtag conversation, itself may be noteworthy.
In the example of self-selection in hashtag analysis using Twitter,
researchers can better account for the confounding variable issues present
in a self-selected sample by going beyond exploratory, correlational
analyses. Twitter datasets should not be considered random or
representative; rather these data should be recognized as self-selected and
missing data treated as “missing not at random”, or missing due to
unobserved or unknown variables. In these analyses, it is also worthwhile
for researchers to examine the cultural and social contexts of these “trending
topics” and interpret their findings appropriately (Tufekci 2014; Meiman
and Freund 2012).
Additionally, big data research projects can be strengthened by pulling
dependent variables from external, validated sources. For example, a study
examining political attitudes using Twitter or Facebook content data as
independent or control variables could be strengthened by using voting
behavior or voting registration data from the U.S. Census as a dependent
variable. This strategy, in particular, is useful in guarding against “selecting
on the dependent variable”, as discussed by Tufekci (2014). Alternatively,
researchers may use outside, reliable sources, like the U.S. Census, to
benchmark their findings, acting as a sort of “gut check” for the findings. It
may also be worthwhile for big data researchers to explore mixed-methods
approaches to their studies. By complementing big data analyses with
surveys, interviews, and other data collection methods, researchers can
better understand the larger context of their data and provide more robust,
representative findings.
Big data research seems poised to revolutionize data analytics as we
know it. By amassing such large amounts of data, researchers can observe
correlations that may not manifest in smaller samples and can analyze large,
near real time streams of data from numerous sources. While detecting
previously missed correlations can spur new research questions and new
understandings of processes in fields like health and social science, it is not
a substitute for established theory, hypothesis development grounded in the
extant social science literature, and statistical modeling. These data, as
shown through this paper, may also be subject to selection biases, which
can skew the findings and implications of the research. By incorporating
more “small data” methods and techniques into research, such as mixed-
method studies, alternate non-linear models, and benchmarking, big data
analysts can strengthen their studies and findings and advance big data
research.
References
Anderson, Chris. 2008. "The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete." Wired 16-07.
Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data”.
American Sociological Review, 48(3), 386-398.
Bollier, David, and C. M. Firestone. 2010. The Promise and Peril of Big Data.
Washington, D.C.: Aspen Institute, Communications and Society Program.
De Mauro, Andrea, Marco Greco, and Michele Grimaldi. 2015. “What is Big Data? A
Consensual Definition and a Review of Key Research Topics”. In AIP Conference
Proceedings (Vol. 1644, pp. 97-104).
Duggan, Maeve, Nicole B. Ellison, Cliff Lampe, Amanda Lenhart, and Mary Madden.
2015. “Social Media Update 2014,” Pew Research Center. [Available at:
http://www.pewinternet.org/2015/01/09/social-media-update-2014/ (Accessed 10
September 2015)]
Fan, Jianqing, Fang Han, and Han Liu. 2014. “Challenges of Big Data Analysis”.
National Science Review, 1(2), 293-314.
Fall 2015
This content downloaded from
196.191.53.146 on Wed, 20 Jul 2022 06:00:08 UTC
All use subject to https://about.jstor.org/terms
36
Geddes, Barbara. 1990. "How the Cases You Choose Affect the Answers You Get:
Selection Bias in Comparative Politics." Political Analysis 2: 131-150.
Harford, Tim. 2014. “Big data: A big mistake?” Financial Times, 11(5), 14-19.
Heckman, James J. 1979. “Sample Selection Bias as a Specification Error”.
Econometrica 47(1) 153-161.
Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. 1998.
“Characterizing Selection Bias Using Experimental Data”. Econometrica, 66(5),
1017-1098.
LaValle, Steve, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina
Kruschwitz. 2013. “Big data, Analytics and the Path from Insights to Value”. MIT
Sloan Management Review, 21.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis”. Science 343(6176), 1203-1205.
Lohr, Steve. 2012. “The Age of Big Data”. New York Times, Feb. 11 2012.
Meiman, Jon, and Jeff E. Freund. 2012. "Large Data Sets in Primary Care Research." The
Annals of Family Medicine 10(5) 473-474.
Pew Research Center. 2015. “The Smartphone Difference” [Available at:
http://www.pewinternet.org/2015/04/01/us-smartphone-use-in-2015/ (Accessed 10
September 2015)]
Price, Megan, and Patrick Ball. 2014. “Big Data, Selection Bias, and the Statistical
Patterns of Mortality in Conflict”. SAIS Review of International Affairs, 34(1), 9-20.
Raghupathi, Wullianallur, and Viju Raghupathi. 2014. “Big Data Analytics in Healthcare:
Promise and Potential”. Health Information Science and Systems, 2(1)
[http://www.hissjournal.com/content/2/1/3 (accessed 10 September 2015)]
Sagiroglu, Serfef and Sinanc Duygu. 2013. “Big Data: A Review”. Proceedings for the
Institute of Electrical and Electronics Engineers 2013 Annual Conference. 42-47
Taylor, Eric. 2014. “Spending More of the School Day in Math Class: Evidence from a
Regression Discontinuity in Middle School”. Journal of Public Economics, 117,
162-181.
Tufekci, Zeynep. 2014. “Big Questions for Social Media Big Data: Representativeness,
Validity and Other Methodological Pitfalls”. In ICWSM ’14: Proceedings of the 8th
International AAAI Conference on Weblogs and Social Media. 504-514.
Yang, Qiang, and Xindong Wu. 2006. “10 Challenging Problems in Data Mining
Research”. International Journal of Information Technology & Decision Making,
5(04), 597-604.
Bios
Fall 2015
This content downloaded from
196.191.53.146 on Wed, 20 Jul 2022 06:00:08 UTC
All use subject to https://about.jstor.org/terms
38