Beware The Big Errors of 'Big Data' - NassemTaleb - Wired PDF

Beware the Big Errors of 'Big Data' | Wired Opinion | Wired.com http://www.wired.
com/opinion/2013/02/big-data-means-big-errors-people/
Opinion
You're Entitled To Our Opinion
business & enterprise
sciences
Like 2k2,046
Tweet 302 Share 823
Beware the Big Errors of Big Data

By Nassim N. Taleb
02.08.13
9:30 AM
Photo: Kam2y/Flickr
Were more fooled by noise than ever before, and its because of a nasty phenomenon called big
data. With big data, researchers have brought cherry-picking to an industrial level.
Modernity provides too many variables, but too little data per variable. So the spurious
relationships grow much, much faster than real information.
In other words: Big data may mean more information, but it also means more false information.
1 of 6 2/24/2013 11:55 AM
Beware the Big Errors of 'Big Data' | Wired Opinion | Wired.com http://www.wired.com/opinion/2013/02/big-data-means-big-errors-people/
Nassim Taleb
Nassim N. Taleb is the author of Antifragile, which this piece is adapted from. He is a former derivatives trader who became a
scholar and philosophical essayist. Taleb is currently a distinguished Professor of risk engineering at New York Universitys
Polytechnic Institute. His works focus on decision making under uncertainty, in other words what to do in a world we dont
understand. His other book is The Black Swan: The Impact of the Highly Improbable.
Just like bankers who own a free option where they make the profits and transfer losses to
others researchers have the ability to pick whatever statistics confirm their beliefs (or show
good results) and then ditch the rest.
Big-data researchers have the option to stop doing their research once they have the right result.
In options language: The researcher gets the upside and truth gets the downside. It makes
him antifragile, that is, capable of benefiting from complexity and uncertainty and at the
expense of others.
But beyond that, big data means anyone can find fake statistical relationships, since the spurious
rises to the surface. This is because in large data sets, large deviations are vastly more
attributable to variance (or noise) than to information (or signal). Its a property of sampling: In
real life there is no cherry-picking, but on the researchers computer, there is. Large deviations
are likely to be bogus.
We used to have protections in place for this kind of thing, but big data makes spurious claims
even more tempting. And fewer and fewer papers today have results that replicate: Not only is it
hard to get funding for repeat studies, but this kind of research doesnt make anyone a hero.
Despite claims to advance knowledge, you can hardly trust statistically oriented sciences or
empirical studies these days.
This is not all bad news though: If such studies cannot be used to confirm, they can be effectively
used to debunk to tell us whats wrong with a theory, not whether a theory is right.
Another issue with big data is the distinction between real life and libraries. Because of excess
data as compared to real signals, someone looking at history from the vantage point of a library
will necessarily find many more spurious relationships than one who sees matters in the making;
he will be duped by more epiphenomena. Even experiments can be marred with bias, especially
when researchers hide failed attempts or formulate a hypothesis after the results thus fitting
the hypothesis to the experiment (though the bias is smaller there).
2 of 6 2/24/2013 11:55 AM
This is the tragedy of big data: The more variables, the more correlations that can show
significance. Falsity also grows faster than information; it is nonlinear (convex) with respect to
data (this convexity in fact resembles that of a financial option payoff). Noise is antifragile.
Source: N.N. Taleb
The problem with big data, in fact, is not unlike the problem with observational studies in
medical research. In observational studies, statistical relationships are examined on the
researchers computer. In double-blind cohort experiments, however, information is extracted in
a way that mimics real life. The former produces all manner of results that tend to be
spurious (as last computed by John Ioannidis) more than eight times out of 10.
Yet these observational studies get reported in the media and in some scientific journals.
(Thankfully, theyre not accepted by the Food and Drug Administration). Stan Young, an activist
against spurious statistics, and I found a genetics-based study claiming significance from
statistical data even in the reputable New England Journal of Medicine where the results,
according to us, were no better than random.
Big data can tell us whats wrong, not whats right.
And speaking of genetics, why havent we found much of significance in the dozen or so years
since weve decoded the human genome?
Well, if I generate (by simulation) a set of 200 variables completely random and totally
unrelated to each other with about 1,000 data points for each, then it would be near
impossible not to find in it a certain number of significant correlations of sorts. But these
correlations would be entirely spurious. And while there are techniques to control the cherry-
picking (such as the Bonferroni adjustment), they dont catch the culprits much as regulation
didnt stop insiders from gaming the system. You cant really police researchers, particularly
when they are free agents toying with the large data available on the web.
I am not saying here that there is no information in big data. There is plenty of information. The
problem the central issue is that the needle comes in an increasingly larger haystack.
Related
You Might Like
Related Links by Contextly
Stop Hyping Big Data and Start Paying Attention to Long Data
The Surprising Truth: Technology Is Aging in Reverse
The End of the Web, Search, and Computer as We Know It
PayPal's New U.K. Card Reader Exposes Inferior U.S. Tech
Nokia Set to Go Cheap, Releasing Bargain Smartphones
3 of 6 2/24/2013 11:55 AM
From Food to Furniture: 10 Dazzling Laser-Cutter Projects
Tags: big vs.

Actually, Mr.small,
Brill,memes,
FixingR&D
Healthcare Is Kinda Simple
Post Comment | 49 Comments and 0 Reactions | Permalink
Back to top
Like 2k2,046
Tweet 302
Reddit Digg Stumble Upon Email
4 of 6 2/24/2013 11:55 AM
49 comments 15
Discussion Community Share #
Nicolas 16 days ago

I think Taleb is confused. Big data is about adding rows of data, more than variables per
observation (columns). Overfit becomes less of a problem the more data (rows) you have. It is much
easier to overfit the less data (rows) you have.
None of the examples Taleb discusses are actually Big Data. They are quite the opposite!
26 1 Reply Share
Anthony Boyles > Nicolas 16 days ago

Taleb is dumbing it down to make his point--which is ironically isomorphic to the very
problem he's discussing. By picking on one particular pitfall of Big Data analytics, he's choosing
the result he wants people to be wary of (false positive results) at the expense of the rest of the
picture (Type I errors, identifiable signal, rejected noise).
That said, I think it's just as simplistic to say Big Data is "about" more observations. To me, Big
Data means both: more variables, and WAY more observations.
17 Reply Share
Nicolas > Anthony Boyles 16 days ago

I agree with you. I should have emphasized the "more than" part.
6 1 Reply Share
WTF > Nicolas 16 days ago

I see it as both more rows and coloumns. I agree from the article that Taleb might be
dealing with limited assumptions, but that does not exclude the fact that publications still look for
stupid statistical significance and procedures like boneferroni corrections are still uncommon.
End result :some one takes data and analyzes for everything without looking at prior probabilities.
I guess we could call this as much a problem of laissez faire research as "big data".
We constantly see this in genetics, but everyone is getting wiser quickly...

4 Reply Share
Solomon Messing 14 days ago

This article is misleading. When the media/public talk about big data, they almost always mean
big N data. Taleb is talking about data where P is "big" (i.e., many many columns but relative few rows,
like genetic microarray data where you observe P = millions of genes for about N = 100 people), but he
makes it sound like the issues he discuss apply to big N data as well. Big N data has the OPPOSITE
properties of big P data---spurious correlations due to random noise are LESS likely with big N. Of
course, the more important issue of causation versus correlation is an important problem when
analyzing big data, but one that was not discussed in this article.
9 Reply Share
Peter Rogan 16 days ago
5 of 6 2/24/2013 11:55 AM
6 of 6 2/24/2013 11:55 AM

Beware The Big Errors of 'Big Data' - NassemTaleb - Wired PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beware The Big Errors of 'Big Data' - NassemTaleb - Wired PDF

Uploaded by

Copyright:

Available Formats

Beware the Big Errors of 'Big Data' | Wired Opinion | Wired.com http://www.wired.

Beware the Big Errors of Big Data

The Surprising Truth: Technology Is Aging in Reverse

The End of the Web, Search, and Computer as We Know It

PayPal's New U.K. Card Reader Exposes Inferior U.S. Tech

Nokia Set to Go Cheap, Releasing Bargain Smartphones

From Food to Furniture: 10 Dazzling Laser-Cutter Projects

Tags: big vs.

Reddit Digg Stumble Upon Email

Discussion Community Share #

Nicolas 16 days ago

Anthony Boyles > Nicolas 16 days ago

Nicolas > Anthony Boyles 16 days ago

WTF > Nicolas 16 days ago

We constantly see this in genetics, but everyone is getting wiser quickly...

Solomon Messing 14 days ago

Peter Rogan 16 days ago

You might also like