ACTICS USED IN THE
Scientists and other theory-driven data analysts focus on eliminating bias and maximizingaccuracy so they can find trends and patterns in their data
. That’s necessary
for any type of dataanalysis. For statisticians, though, the real enemy in the battle to discover knowledge
much accuracy as it is precision. Call it lack of precision, variability, uncertainty, dispersion,scatter, spread,
noise, or error, it’s
all the same adversary.One caution before advancing further.
There’s a subtle
difference between data mining and data dredging.Data mining is the process of fining patterns in largedata sets using computer algorithms, statistics, and justabout anything else you can think of short of voodoo.Data dredging involves substantial voodoo, usuallyinvolving overfitting models to small data sets that
aren’t really as representative of a population as you
might think. Overfitting
doesn’t occur in a
it’s more of a siege process.
It takes a lot of work. Even
easy to lose track of where you are and what
you’re doing when you’re focused exclusively on a
goal. The upshot of all this is that you may be creatingfalse intelligence. Take the high ground
. Don’t interpret
statistical tests and probabilities
too rigidly. Don’t trust diagnostic statistics with your
professional life. Any statistician with a blood alcohol level under 0.2 will make you look silly.Here are ten tactics you can use to try to control and reduce variability in a statistical analysis.
Know Your Enemy
There’s an old saying, six months collecting data will save you a week in the library.
Be smartabout your data. Figure out where the variability might be hiding before you launch an attack.Focus at first on three types of variability
sampling, measurement, and environmentalhttp://statswithcats.wordpress.com/2010/08/01/there%E2%80%99s-something-about-variance/ . Sampling variability consists of the differences between a sample and the population that areattributable to how uncharacteristic (non-representative) the sample is of the population.Measurement variability consists of the differences between a sample and the population that areattributable to how data were measured or otherwise generated. Environmental Variabilityconsists of the differences between a sample and the population that is attributable to extraneousfactors. So there are three places to hunt for errors
how you select the samples, how youmeasure their attributes, and everything else you do.
OK, I didn’t say it was goin
g to be easy.
Start with Diplomacy
Start by figuring out what you can do to tame the error before things get messy. Consider howyou can use the concepts of reference, replication, and randomization. The concept behind usinga reference in data generation is that there is some ideal, background, baseline, norm,benchmark, or at least, generally accepted standard that can be compared to all similar data
operations or results. If you can’t
take advantage of a reference point to help control variability,
A lone sentry sits on watch in the night.