You are on page 1of 18




Most published scientific results are false.
That's the thesis arguedconvincinglyby John Ionnadis in his landmark
2005 paper.
Here's one reason for this.
Imagine that you're living in an alternate past. The year is 1850. Its the
height of the California gold rush. Except, in this timeline, youre living in a
more scientifically literate society, so researchers set out to prove that there
are ways to more efficiently strike it rich. An agenda I can get behind.
These researchers try whatever they can think oftesting, to name a few,
the divine power of prayer, sacrificing groundhogs, and dowsing. None of
these pan out (heh).
Then, one otherwise uneventful Tuesday, a paper is published, "Human
echolocation of gold veins: evidence of an adaptive mechanism for aural mineral detection." The paper describes an experiment where researchers split
1000 gold-hungry migrants into two groups: a traditional control and an echolocation group. The echolocation group was instructed to walk around clacking their tongues in pursuit of the shiny yellow stuff.
Here's the crazy part: the echolocation people found a statistically significant amount of excess gold. The authors speculate that humans have a heretofore unknown sensory organ that picks up mineral densities. This, they say,
was useful in the ancestral environment for tasks like finding salt, copper,
and primo soon-to-be farmland.
Everyone reads the paper. It's flawless, compelling. Id call it a slam dunk
but basketball wont be invented for another 40 years. Soon, only fools search
for gold without clacking their tongues madly, harnessing this power of human echolocation.

Except, you know, homo sapiens (unfortunately) have no particular knack
for echomining. So what went wrong? Why is this paper false, even though the
researchers executed the scientific process the "right" way?

Here's one possible answer: those researchers weren't the only ones that
tested echolocation. There was a lot of excitementgold to be had! Anything
and everything was tried.
For the researchers who found nothing other than the null result, no paper was published. "Of course humans can't detect gold with tongue clacking.
This is not publication-worthy science."
But, in a universe where enough people test something, someone is
bound to obtain a significant result. The laws of probability dictate it.
If 20 teams tested human echolocation, the odds of one team reporting a
false positive are not 5%, like you might expect with a p-value threshold of
.05, but 65%.

This concept is called the "file drawer problem," because studies that find
expected, business-as-usual results are relegated to the metaphorical file
drawer, away from the light of the fair sun.
This is a real problem, with real consequences. The antidepressant Reboxetin, for instance, was approved for use after clinical trials found it effective.
One problem. The stuff doesnt work. It was publication bias on the part
of manufacturer Pfizer. Pfizer ran many studies, publishing only the positive
Something similar, but less nefarious, happens with the study of extrasensory perceptionyou know, psychics, mind control, that kind of thing.
When a study finds that people can't read minds (duh), the paper doesn't get
either written up or published.
But when such a study finds a positive result, everyone wants to read it.
There is a whole field like this, parapsychology, and many of the papers it encompasses are at least as fanciful as my echolocation example.
Like JB Hasted's paper, "Paranormal Metal Bending", where he reports
that participants could bend metal with their minds. It contains a photo of a
glass globe filled with paranormally-scrunched paper clips. Notably, there is
hole in the globe, but Halsted explains, "We have found it necessary that a
small orifice be left in the glass globes in which wires are bent.

Amusingly, academic finance has the opposite problem. If you do discover an anomaly in, say, the stock market, you don't publish it. You arbitrage
the opportunity away, making millions in the process.

This is one of the reasons why most published research is false. In the
coming pages, I'll detail 5 more common mistakes, and what you can do to
avoid them.

Mistakes Ive Made

These are all errors that I can recall making myself at one point or another. Mistakes I'm sometimes embarrassed to admit ever happened, but they
did and are, in the case of my GitHub repositiories, a matter of public record.
Yeah, I've been fooled a lot of timesto name a few:
The depletion model of willpower. I read this book by Roy Baumeister,
and then later saw him speak. It all made sense to mewhen you do something hard, your brain "uses up" some willpower, and that needs replenished,
maybe via a sugary drink.
But this has failed to replicate a couple of times. The most damning of
which is a study by Veronika Job, Carol Dweck, and Greg Walton, which found
that the depletion model of willpower only affects those who believe in it. It's
"all in your head."
Happiness tipping points. Okay, this one is embarassing, because I
really should have known better. Barbara Frederickson and Marcial Losada
came out with this pretty neat paper in 2005, called Positive Affect and the
Complex Dynamics of Human Flourishing.
Basically, it argues, that above a certain ratio of positive to negative emotions, there exists a tipping point where humans flourish. So, if you have 2
good moments for every bad moment, you're kinda miserable. But if you have
3 good moments, you're thrivinglike the Dalai Lama or something.
So, the red-flag that I should have spotted in this paper is that the
authors peg that tipping point at 2.9013.
You're never going to get a value that precise in psychology.

And I say that this is a red-flag, because the paper was later retracted. Of
it, one critique said that there is "no theoretical or empirical justification for
the use of differential equations drawn from fluid dynamics, a subfield of
physics, to describe changes in human emotions over time."
Whoops again.
Stereotype threat. One more, I can't resist. How I have been wrong! Let
me count the ways.
The basic idea is that priming negative stereotypes can hinder performance. So if your son is in ballet, and about to compete, and you bring up
something about how strange it is for a boy to be doing ballet, well, that
might mess him up.
Actually, the usual example is even weaker than that. Say you're a girl
and about to take a math test, and you've heard the "girls are bad at math"
meme. If, as part of the exam, you're required to indicate your gender, you'll
perform worse.
On reflection, this is not the most believable thing in the world. It sounds
like, "Well, I guess that could be true, but it seems sorta forced."
And probably it's not true, or at least not very true, such that having people indicate their gender or whatever before an exam has a very small effect,
if any.
The paper that torpedoed it for me is, "An Examination of Stereotype
Threat Effects on Girls' Mathematics Performance," which found substantial
publication biasthe same problem our echolocating gold miners had. Studies
that could find no evidence of a stereotype threat tended not to be published.
So this theory is not true, or at least not true enough to matter.
We can do better.

Being wrong sucks.
That's why I'm writing this.
It sucks not in a general for-the-good-of-all-men way, but for selfish
You don't want to be wrong about how the world works, because it will
be embarrassing if you get into an argument.

You don't want to be wrong because if you are mistaken and you act on
that mistaken belief when pursuing a goal, probably it's not going to work
And, finally, if you assume a wrong thing to be true and build off of
that knowledge, when you discover your mistake, you'll also have to throw
out all the beliefs connected to that belief.
If that's not enough (it should be), the main reason we know the name Johannes Kepler today is because of his fanatical devotion to creating planetary
models consistent with observational data.
He really didn't want to be wrong.
We could all benefit from being a little more like Kepler. In the upcoming
sections, I'll cover 5 common beginner stats mistakes and how to avoid them.



I can't see the full electromagnetic spectrum and no matter how hard I squint,
I've yet to produce a single laser beam.
But, other than these two minor bugs, I'm satisfied with the human visual
system. It's powerful. At least as far as human systems go. (The mantis
shrimp has us beat. By a lot.)
And it makes sense when you think about. Eyes are an example of convergent evolution. Nature stumbled on them somewhere between 50 and 100
times. They're that good. If you want to survive, you kinda need them.
Our best estimates put the first appearance of eyes on Earth at about
600 million years ago. That much time equals a great deal of selective pressure. How many thousands of generations of prehominids had to die so that
you can now make out the text on your iPhone's retina display?
This should give us some measure of confidence that they're welloptimized.
The human visual system is powerful.
Powerful enough that I've underestimated it before.
One of the major insights I had while memorizing more than 10,000 virtual flashcards was that graphical depictions of concepts are much more useful than verbal ones. They're retained better, more flexible, and easier to reason with.
Andrew Drucker has written one of my favorite papers of all time. In it,
he describes how he multiplied 10 digit numbers in his head, with one external aid: Flickr. The process took him 7 hours.
How did he do it?
By exploiting the power of the human visualize systemspecifically the
bit that recognizes something you've seen before.
The idea is this: normally, when multiplying two large numbers, you'd
use a piece of paper to write down the intermediate calculations. Instead,
Drucker took a numbered set of images and, when he needed to store an intermediate result, he would familiarize himself with the image.

For instance, in this picture, the blue numbers

are indexes. If he wanted to store the number
0 in position 12, he would familiarize himself
with the 120th picture in his Flickr stream.
This is just one delightful application of the human visual system.
In a more straightforward one, Lionel Standing
had participants look at 10,000 photographs
over the course of 5 days. They saw each image for 5 seconds. He later showed them a group of images and asked them
which they'd seen before.
The participants could identify the correct images with an 83% success
rate, implying that they'd memorized about 6,600 images.
Welling to a crescendo here, then, the point is: visualize your data!
There's no quicker, easier way to insight than graphing, dot-plotting, 4dsquiggle-charting your data set.
The most important part of any data analysis is getting a feel for you
data. Without understanding its structure, it's impossible to have any kind of
confidence in your analysis.
Indeed, depending on your project, sometimes plotting the data is sufficient. If a trend is visible to the naked eye, most of your work is done for you
and, when it's not, often this means that there is no such trend at all.

How to Fix It
The fix here is trivial. Learn about different types of visualizations and
then experiment with them. Try typical graphs, dotplots, heatmaps, whatever.
Learn how to create these with your software system of choice.
This is my favorite book on the subject, by the way.
I saw a video a while back where one guy won a Kaggle competition because he opened the data set in Excel and colored the numbers according to
their severity. This gave him an intuitive understanding of the data which allowed him to build a better model and win.
Plot your data.


Overfitting is a subset of that too-human passtime: seeing patterns where
none exist.
Here's an example.
When Nazi Germany bombed London during World War II, the British
came up with a lot of theories about when and where the Germans were bombing. Trying to figure out their schedule. On Sundays, one part of London.
Other days, another.
They were pretty confident that theyd figured out at least part of it.
Only problem: later, more rigorous analysis revealed that there were no
such patterns. No schedules, nothing. The distribution of bombs was random.
This is also the way that baseball players are announced: Next up to bat,
Ricky Example. Mr. Example is distinguished by the highest number of home
runs on Sunday afternoons. Ladies and gentleman, let's see what he's got for
See, with overfitting, you can describe anyone as the best at something,
as long as it's specific enough. (This is one reason you should never believe
those lists of best cities, best colleges, or best jobs. With the right metrics
and weightingswhich are often mostly arbitraryanything can be #1.)
More generally, overfitting is when you fit a model too tightly to your
data. Like, once when working on a model for predicting book ratings, I ran
some automated feature selection algorithms, but I got stuff out like, "If a
book is published in December with more than 40 citations and between 1
and 5 reviews on Amazon, the user will give this book 5 stars."
That sort of thing is not going to generalize to new data, it's an irrelevant
characteristic of the input data set.

Why Overfitting Happens

What's going on? Why does overfitting happen?

It's all in the name, sort of. Overfitting occurs when you choose a model
that's too specific to your training data set and doesn't generalize to other datawhich is why you want the model in the first place. You want transfer.
Imagine that you coach people on mountain climbing and one day you're
hired by a client, Bob Overfit. He wants to learn to climb mountains, so you
both drive out to Eagle Mountain, and you show him the ropes, literally and
A few weeks pass, and you get a call from Bob. Dude is irate. He tried to
go out to Mount Falcon with his pals, but kept stumbling and getting lost.
You were supposed to train him to climb mountains!
So you ask him a few questions, attempting to debug the problem. Finally, you ask him, "Bob, what exactly was your strategy for climbing Mount
"Well, I closed my eyes and followed the path that I'd memorized for
climbing Eagle Mountain."
That's overfitting. Memorizing to climb one specific mountain isn't going
to help you climb all mountains, just like building a model too tightly around
one data set won't give you good performance on all data sets.

How to Prevent Overfitting

Once you understand the phenomenon of overfitting, preventing it from
happening is not too difficult.
The most straightforward way to do this is to have two sets of data: a
training set and a test set. You can build your model of the first set, and then
benchmark its performance with the second one.
Back when I was first teaching myself calculus, this is the exact strategy I
used. I only had two tests, and I didn't want to get into a situation where I'd,
through repetition, accidentally memorized all the answers. I wanted to be
sure that I knew the techniques.

To ensure that this happened, I'd practice with only one of the tests, and
then I'd use the other one to benchmark my performance, being sure not to
do that too often.
This ended up working well.
However, this is a pretty painful solution if you have a small data set (say
10,000 samples or less), because leaving out any data gives you less to build
an accurate model with.
To get around this, you can cross-validate your model.
In k-fold cross validation, for instance, a data set is split into k subsets.
One of those subsets is held out as a test set, while the model is trained on
the rest. This is then repeated, rotating which k is the hold out. By averaging
the results, the hope is that you'll get a more accurate idea of how a model
will perform in the wild.
The best thing you can do, though, really, is just avoid building models
that don't make any sense. If you've "discovered" that babies born on the
first Sunday of every month are more likely to be criminals, but you have
no compelling reason why, you've probably done something wrong.
This applies more broadly, too. When you hear about a new scientific
study in the news that says something like, "People who wash their hands after eating losing more weight," be skeptical.
Use your brain. Don't overfit.




Let's say that you walk into your house or apartment, and it's a balmy 50
degrees. You'd rather have something around 70. If you turn your thermostat
to 90 degrees, will it get hotter faster than if you set it to 70?
Take a minute to think of your answer.
Okay, here's the answer: if you answered that it would get hotter faster,
or you weren't sure, you don't understand how your heating works. To increase the temperature, your furnace kicks on. It burns at one heat and, once
the house is at the right temperature, the furnace kicks off. When the temperature falls, it kicks back on.
This is, by the way, the same way an oven works.
You can go your entire life without understanding how the heating in
your house works. If your furnace breaks down, you can hire a professional to
fix it.
The same cannot be said of statistics. If you write an analysis, and you
don't understand how the tools work, you will fuck up everything. Remember in the beggining when I talked about Barbara Frederickson being incorrect
about a happiness tipping point?
She didn't understand how the math worked and that it wasn't applicable
in this case.
This is the exact problem that I'm talking about. Before I had a firm
enough grasp of fitting a polynomial to a data set in R, I made some very
dumb mistakes. Don't do this.
For instance, even when it comes to one of the most common statistical
tools, p-values, 80% of methodology *instructors* get it wrong, and 100% of

How to Fix This

If you're uncertain about what p-values actually mean, or feel like statistics is some kind of magic, sit down and figure out how your models work.
This is not as difficult as you expect, and well worth it.


I once had a question about whether or not the student's t-test the right
test for me to be using. I spent some time on google and even asked a few
people. This proved to be wholly unnecessary. Sitting down for 15 minutes
and grokking the mathematical definition was enough to convince me that it
was the right tool and that I should continue.
Do the same thing. Sit down, figure it out, and move on with confidence.
If you're worried about understanding, use the Feynman method, which I'll
cover in a future email.



In the introduction, I mentioned the mysterious case of gold echolocation.
The culprit was that many people were testing something, but only those that
had fluke positive results were reported.
Data mining works by the same mechanism but, instead of many testing
one thing, one person tests many things.
This is accurately summed up by this xkcd:


See, the issue is when you mix up exploratory analysis and actually testing something. In general, your process should be along the lines of:
1. Hm, I wonder if x is true. Here's how I could test it.
2. Collect relevant data.
3. Test it.
Don't test a ton of things on one data set and then only report that which
is statistically significant. This isn't how it works. Just by chance alone, you'll
end up reporting false things.

How To Fix This

Exploration is a necessary part of figuring out what to study. If you want
ideas, you need strong priors about the world. Reading broadly and data are
the way to acquire these.
So, how do we reconcile this with the fact that you can't run a bunch of
tests on one data set and expect true results with standard techniques?
You have a couple of options.
You can decide ahead of time what to test.
You can explore one data set and test on another.
You can correct for multiple tests with statistical techniques such as the
Bonferroni correction.



The point of statistics is inference. You want to take a sample of, say, 35
people, and be able to learn something about everyone.
This means you need to take a representative sample. If you don't, your
results will be worthless (or even harmful).
My favorite of this is Saturday Night Live's "Weekend Update." This is one
of those fusions of news and humor that is quickly becoming too common.
Anyways, one night they had this to say, A recent poll showed that 58%
of Americans now favor the legalization of marijuana. Of course, this was 58%
of people who were at home in the middle of the day to answer a telephone
The implication: those who answered the poll weren't a random sample
and, thus, can't reflect the opinions of Americans as a whole.
This problem is not limited to Gallup polls.
Psychological science reeks of it. So much so that we have an acronym:
WEIRD. Because participants are overwhelmingly from, "Western, educated,
and from industrialized, rich, and democratic countries."
Indeed, it's even worse, because the sample is almost overwhelmingly
from college kids who are doing it for extra credit.
A similar problem plagues businesses. I had a chance to hear a presentation by one of FlightCar's founders, and he mentioned the problem with surveys: only those who love and hate you leave them. If you want responses
from people in the middle, you'll have to do something else.
So, okay. Those are unrepresentative samples. If you fuck up your sample, your results will be useless. Garbage in, garbage out.
So how do you prevent this?

How To Fix This

There are entire books covering sampling, and I can't claim to be an expert. Use your common sense, though. Think about your the population you
want to sample and ask yourself, "If, in 6 months, I've discovered that my sam-


pling was wrong, what will have been the problem?" (This is sometimes called
a pre-mortem.
A pretty typical problem with calling people is that there are certain subsets who just won't answer. So you have to come up with some way of getting
in touch with this subset. Keep calling, letters, email, knocking on doors. I
don't know. But you'll have to figure it out.



Here's the executive summary then.
Not plotting the data. The human visual system rocks. Use it. Learn
how to chart data. Read this book, absorb it, love it, and visualize.
Overfitting. Overfitting occurs when a model is trained too closely to
the training set, such that the performance doesn't generalize to new data.
It's like learning to climb mountains by memorizing the route up one mountain. To prevent this, test your model against a hold out set or use cross validation.
Not understanding your tools. If you use a statistical model without
understanding how it functions, you will fuck it up. If you're confused, sit
down and figure it out. Use the Feynman method, which I'll cover in a future
Data mining instead of hypothesis testing. If you run many experiments on one data set, you need to correct with something like a Bonferroni
correction. Pick one hypothesis and test it. Don't go exploring.
Taking an unrepresentative sample. If you want to say something
about all Americans, you need to take a sample representing all Americansnot representing people taking psychology class on your campus or
similar. Fix this by thinking long and hard about potential issues with your
sampling mechanism.
That's all. Go forth and produce error-free statistics! Find gold.