You are on page 1of 2

Growing Doubts About Big Data

There's quite a kerfuffle going on in the world of big data, with a range of prominent articles in the
past month suggesting it's not the analytical holy grail it's been made out to be. Taken together,
these pieces suggest the start of a serious rethink of what big data can and can't actually do.

Perhaps most prominent is a piece in the journal Science on March 14. It builds from an article in
Nature last year reporting that Google Flu Trends (GFT), after a promising start, flopped in 2013,
drastically overestimating peak flu levels. Science now reports that GFT overestimated flu
prevalence in 100 of 108 weeks from August 2011 on, in some cases with estimates that were double
the CDC's prevalence data.

As well as picking apart GFT's problems (inconsistent data source, possibly inconsistent
measurement terms) the authors blame "big data hubris," which they define as "the often implicit
assumption that big data are a substitute for, rather than a supplement to, traditional data collection
and analysis." Fundamentally, they add: "The core challenge is that most big data that have received
popular attention are not the output of instruments designed to produce valid and reliable data
amenable for scientific analysis."

While "enormous scientific possibilities" remain, the authors say, "quantity of data does not mean
that one can ignore foundational issues of measurement." They also point (as we have in the past) to
the vulnerability of some big data sources (including Twitter and Facebook) to intentional
manipulation.

A prominent research statistician and author, Kaiser Fung, followed with a pretty sharply worded
blog post, not only calling the GFT flu estimates an "epic fail" but saying it's emblematic of a broader
problem in the big data world: "Data validity is being consistently overstated."

As to GFT itself, he added, "Google owes us an explanation as to whether it published doctored data
without disclosure, or if its highly-touted predictive model is so inaccurate that the search terms
found to be the most predictive a few years ago are no longer predictive. If companies want to
participate in science, they need to behave like scientists."

Fung and the Science piece both were quoted in an op-ed in this Sunday's New York Times , in
which a pair of New York University professors take their turn, pointing out, for instance, that large
datasets can produce large numbers of correlations that are merely spurious, that "many tools that
are based on big data can be easily gamed" and that analytical tools can create an "echo-chamber
effect," for example when Google Translate pieces together translation patterns on the basis of
articles that have been produced using... Google Translate.

There's more: A piece in the Financial Times on March 28, "Big data: are we making a big mistake?"
presents another pointed look at the shortcomings in big-data analysis, suggesting that reliance on
correlations in the absence of a theory of their cause is "inevitably fragile." The size and inherent
messiness of big data, the piece adds, can conceal misleading bias within. It includes this comment
from David Spiegelhalter, a professor at Cambridge University: "There are a lot of small data
problems that occur in big data. They don't disappear because you've got lots of stuff. They get
worse."
Finally, there's a paper prepared for a
https://www.facebook.com/permalink.php?id=306557752703199&story_fbid=1768446526514307
conference of the Association for the Advancement of Artificial Intelligence by Zeynep Tufekci, an
assistant professor at the University of North Carolina, Chapel Hill, entitled, "Big Questions for
Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls." She lays out
a range of difficulties in attempting to draw meaning from social media data in terms of sampling
and analytical challenges alike, including many discussed in our own briefing paper on social media,
first released in August 2012.

None of these pieces suggests that the concept of big data is dead. Rather they represent a pullback
from the heady notion that very large datasets can somehow allow researchers to set aside the
niceties of sampling, theory and attention to measurement error. More sober days may follow.

http://abcnews.go.com/blogs/politics/2014/04/growing-doubts-about-big-data/

You might also like